Calculate TF-IDF using DuckDB

Updated: 2023-11-24

This SQL query calculates the term frequency-inverse document frequency (TF-IDF) scores of words in a set of 4 documents represented by the sentences CTE.

Loading editor...

WITH 
sentences(sentence, doc) AS (
    VALUES
        ('Air quality in the sunny island improved gradually throughout Wednesday.', 1),
        ('Air quality in Singapore on Wednesday continued to get worse as haze hit the island.', 2),
        ('The air quality in Singapore is monitored through a network of air monitoring stations located in different parts of the island', 3),
        ('The air quality in Singapore got worse on Wednesday.', 4)
), base_tokenized AS (
    SELECT 
        UNNEST(STRING_SPLIT(REGEXP_REPLACE(LOWER(sentence),'[^-0-9A-Za-z ]','', 'g'),' ')) AS token,
        doc
    FROM sentences
), base_stopworded AS (
    SELECT *
    FROM base_tokenized
    WHERE token NOT IN ('the', 'on', 'was', 'in', 'got', 'of', 'to', 'as', 'is', 'a')
), tf_num AS (
    SELECT 
        token,
        doc, 
        COUNT(*) AS doc_tok_cnt
    FROM base_stopworded
    GROUP BY 1,2
), tf_den AS (
    SELECT 
        doc,
        SUM(doc_tok_cnt) AS doc_tot_sum
    FROM tf_num
    GROUP BY 1
), idf_den AS (
    SELECT COUNT(DISTINCT doc) AS ndocs
    FROM base_stopworded
), idf_num AS (
    SELECT 
        token,
        COUNT(DISTINCT doc) AS docs
    FROM
    base_stopworded
  GROUP BY 1
), calc_tf AS (
    SELECT 
        token,
        doc,
        tf_num.doc_tok_cnt,
        tf_den.doc_tot_sum,
        tf_num.doc_tok_cnt/tf_den.doc_tot_sum AS tf
    FROM tf_num
    JOIN tf_den USING (doc)
), calc_idf AS (
    SELECT
        token,
        idf_den.ndocs,
        idf_num.docs,
        LOG(idf_den.ndocs/idf_num.docs) AS idf
    FROM idf_den, idf_num
), calc_tfidf AS (
    SELECT 
        *,
        calc_tf.tf * calc_idf.idf AS tfidf 
    FROM calc_tf
    JOIN calc_idf USING (token)
)
SELECT * 
FROM calc_tfidf 
ORDER BY 
  doc ASC, 
  tfidf DESC

It performs the following operations:

Tokenize the lowercase sentences into individual words by splitting on spaces, after removing all characters that are not letters or numbers.
Remove stop words (“the”, “on”, “was”, “in”, “got”, “of”, “to”, “as”, “is”, “a”).
Calculate term frequency (TF) as the number of times a word appears in a document divided by the total number of words in that document.
Calculate inverse document frequency (IDF) as the logarithm of the total number of documents divided by the number of documents containing the word.
Multiply TF and IDF to get the TF-IDF score for each word in each document.
Return a table of the words, their TF-IDF scores, and the document they appear in, sorted by document and then by TF-IDF score in descending order.

CTEs Breakdown

TF-IDF Query CTE Lineage

sentences: Creates a table with 4 sentences representing the 4 documents. Each row contains a sentence and the document number it belongs to.

base_tokenized: Tokenizes each sentence into individual words. It first removes all characters from the sentence that are not letters or numbers, and then splits the sentence into words by spaces. The result is a table with columns for the word (token) and the document number it belongs to.

base_stopworded: Removes stop words from the base_tokenized table. It only includes words that are not in the list of stop words.

tf_num: Calculates the number of times each word appears in each document. It groups the words and documents from base_stopworded and counts the number of occurrences of each word in each document.

tf_den: Calculates the total number of words in each document. It sums the word counts from tf_num for each document.

idf_den: Calculates the total number of documents.

idf_num: Calculates the number of documents containing each word. It groups the words from base_stopworded and counts the number of unique documents for each word.

calc_tf: Calculates the term frequency (TF) for each word in each document. It joins the tf_num and tf_den CTEs to get the necessary information and calculates TF as the number of times a word appears in a document divided by the total number of words in that document.

calc_idf: Calculates the inverse document frequency (IDF) for each word. It joins the idf_den and idf_num CTEs to get the necessary information and calculates IDF as the logarithm of the total number of documents divided by the number of documents containing the word.

calc_tfidf: Calculates the TF-IDF score for each word in each document. It joins the calc_tf and calc_idf CTEs to get both TF and IDF, and multiplies them to get the TF-IDF score.

TF-IDF example with a larger dataset

Now, lets try with 50,000 HackerNews titles:

Loading editor...

WITH 
sentences AS (
    SELECT 
        title AS sentence, 
        id AS doc 
    FROM 'https://storage.sekuel.com/hn_titles.parquet'
  -- LIMIT 1000
), base_tokenized AS (
    SELECT 
        UNNEST(STRING_SPLIT(REGEXP_REPLACE(LOWER(sentence),'[^-0-9A-Za-z ]','', 'g'),' ')) AS token,
        doc
    FROM sentences
), base_stopworded AS (
    SELECT *
    FROM base_tokenized
    WHERE token NOT IN ('the', 'on', 'was', 'in', 'got', 'of', 'to', 'as', 'is', 'a')
), tf_num AS (
    SELECT 
        token,
        doc, 
        COUNT(*) AS doc_tok_cnt
    FROM base_stopworded
    GROUP BY 1,2
), tf_den AS (
    SELECT 
        doc,
        SUM(doc_tok_cnt) AS doc_tot_sum
    FROM tf_num
    GROUP BY 1
), idf_den AS (
    SELECT COUNT(DISTINCT doc) AS ndocs
    FROM base_stopworded
), idf_num AS (
    SELECT 
        token,
        COUNT(DISTINCT doc) AS docs
    FROM
    base_stopworded
  GROUP BY 1
), calc_tf AS (
    SELECT 
        token,
        doc,
        tf_num.doc_tok_cnt,
        tf_den.doc_tot_sum,
        tf_num.doc_tok_cnt/tf_den.doc_tot_sum AS tf
    FROM tf_num
    JOIN tf_den USING (doc)
), calc_idf AS (
    SELECT
        token,
        idf_den.ndocs,
        idf_num.docs,
        LOG(idf_den.ndocs/idf_num.docs) AS idf
    FROM idf_den, idf_num
), calc_tfidf AS (
    SELECT 
        *,
        calc_tf.tf * calc_idf.idf AS tfidf 
    FROM calc_tf
    JOIN calc_idf USING (token)
)
SELECT * 
FROM calc_tfidf 
ORDER BY 
  doc ASC, 
  tfidf DESC