Calculate TF-IDF using DuckDB
This SQL query calculates the term frequency-inverse document frequency (TF-IDF) scores of words in a set of 4 documents represented by the sentences CTE.
Loading editor...
WITH sentences(sentence, doc) AS ( VALUES ('Air quality in the sunny island improved gradually throughout Wednesday.', 1), ('Air quality in Singapore on Wednesday continued to get worse as haze hit the island.', 2), ('The air quality in Singapore is monitored through a network of air monitoring stations located in different parts of the island', 3), ('The air quality in Singapore got worse on Wednesday.', 4) ), base_tokenized AS ( SELECT UNNEST(STRING_SPLIT(REGEXP_REPLACE(LOWER(sentence),'[^-0-9A-Za-z ]','', 'g'),' ')) AS token, doc FROM sentences ), base_stopworded AS ( SELECT * FROM base_tokenized WHERE token NOT IN ('the', 'on', 'was', 'in', 'got', 'of', 'to', 'as', 'is', 'a') ), tf_num AS ( SELECT token, doc, COUNT(*) AS doc_tok_cnt FROM base_stopworded GROUP BY 1,2 ), tf_den AS ( SELECT doc, SUM(doc_tok_cnt) AS doc_tot_sum FROM tf_num GROUP BY 1 ), idf_den AS ( SELECT COUNT(DISTINCT doc) AS ndocs FROM base_stopworded ), idf_num AS ( SELECT token, COUNT(DISTINCT doc) AS docs FROM base_stopworded GROUP BY 1 ), calc_tf AS ( SELECT token, doc, tf_num.doc_tok_cnt, tf_den.doc_tot_sum, tf_num.doc_tok_cnt/tf_den.doc_tot_sum AS tf FROM tf_num JOIN tf_den USING (doc) ), calc_idf AS ( SELECT token, idf_den.ndocs, idf_num.docs, LOG(idf_den.ndocs/idf_num.docs) AS idf FROM idf_den, idf_num ), calc_tfidf AS ( SELECT *, calc_tf.tf * calc_idf.idf AS tfidf FROM calc_tf JOIN calc_idf USING (token) ) SELECT * FROM calc_tfidf ORDER BY doc ASC, tfidf DESC
It performs the following operations:
- Tokenize the lowercase sentences into individual words by splitting on spaces, after removing all characters that are not letters or numbers.
- Remove stop words (“the”, “on”, “was”, “in”, “got”, “of”, “to”, “as”, “is”, “a”).
- Calculate term frequency (TF) as the number of times a word appears in a document divided by the total number of words in that document.
- Calculate inverse document frequency (IDF) as the logarithm of the total number of documents divided by the number of documents containing the word.
- Multiply TF and IDF to get the TF-IDF score for each word in each document.
- Return a table of the words, their TF-IDF scores, and the document they appear in, sorted by document and then by TF-IDF score in descending order.
CTEs Breakdown
sentences
: Creates a table with 4 sentences representing the 4 documents. Each row contains a sentence and the document number it belongs to.
base_tokenized
: Tokenizes each sentence into individual words. It first removes all characters from the sentence that are not letters or numbers, and then splits the sentence into words by spaces. The result is a table with columns for the word (token) and the document number it belongs to.
base_stopworded
: Removes stop words from the base_tokenized
table. It only includes words that are not in the list of stop words.
tf_num
: Calculates the number of times each word appears in each document. It groups the words and documents from base_stopworded
and counts the number of occurrences of each word in each document.
tf_den
: Calculates the total number of words in each document. It sums the word counts from tf_num
for each document.
idf_den
: Calculates the total number of documents.
idf_num
: Calculates the number of documents containing each word. It groups the words from base_stopworded
and counts the number of unique documents for each word.
calc_tf
: Calculates the term frequency (TF) for each word in each document. It joins the tf_num
and tf_den
CTEs to get the necessary information and calculates TF as the number of times a word appears in a document divided by the total number of words in that document.
calc_idf
: Calculates the inverse document frequency (IDF) for each word. It joins the idf_den
and idf_num
CTEs to get the necessary information and calculates IDF as the logarithm of the total number of documents divided by the number of documents containing the word.
calc_tfidf
: Calculates the TF-IDF score for each word in each document. It joins the calc_tf
and calc_idf
CTEs to get both TF and IDF, and multiplies them to get the TF-IDF score.
TF-IDF example with a larger dataset
Now, lets try with 50,000 HackerNews titles:
Loading editor...
WITH sentences AS ( SELECT title AS sentence, id AS doc FROM 'https://storage.sekuel.com/hn_titles.parquet' -- LIMIT 1000 ), base_tokenized AS ( SELECT UNNEST(STRING_SPLIT(REGEXP_REPLACE(LOWER(sentence),'[^-0-9A-Za-z ]','', 'g'),' ')) AS token, doc FROM sentences ), base_stopworded AS ( SELECT * FROM base_tokenized WHERE token NOT IN ('the', 'on', 'was', 'in', 'got', 'of', 'to', 'as', 'is', 'a') ), tf_num AS ( SELECT token, doc, COUNT(*) AS doc_tok_cnt FROM base_stopworded GROUP BY 1,2 ), tf_den AS ( SELECT doc, SUM(doc_tok_cnt) AS doc_tot_sum FROM tf_num GROUP BY 1 ), idf_den AS ( SELECT COUNT(DISTINCT doc) AS ndocs FROM base_stopworded ), idf_num AS ( SELECT token, COUNT(DISTINCT doc) AS docs FROM base_stopworded GROUP BY 1 ), calc_tf AS ( SELECT token, doc, tf_num.doc_tok_cnt, tf_den.doc_tot_sum, tf_num.doc_tok_cnt/tf_den.doc_tot_sum AS tf FROM tf_num JOIN tf_den USING (doc) ), calc_idf AS ( SELECT token, idf_den.ndocs, idf_num.docs, LOG(idf_den.ndocs/idf_num.docs) AS idf FROM idf_den, idf_num ), calc_tfidf AS ( SELECT *, calc_tf.tf * calc_idf.idf AS tfidf FROM calc_tf JOIN calc_idf USING (token) ) SELECT * FROM calc_tfidf ORDER BY doc ASC, tfidf DESC