Understand TF-IDF in Python
What you will learn
- Learn the concept behind TF-IDF;
- Be able to create a function that calculates TF-IDF;
- Learn to interpret TF-IDF scores;
- Be comfortable with visualizing your results.
Table of Contents
- Introduction
- Data source
- Coding the past: identifying relevant words in historical documents
- Conclusions
Introduction
“Changes are shifting outside the words.”
Annie Lennox
In the Text Data Visualization lesson we learned how to use term frequency to identify the most frequent words in a document. Nevertheless, this methodology focuses only on the prevalence of individual terms within a single document, neglecting the term’s occurrence across other documents within the corpus. As a countermeasure to this shortcoming, TF-IDF was introduced. TF-IDF stands for Term Frequency - Inverse Document Frequency. This numerical statistic serves to underscore the significance of a word within a document, relative to a broader collection or corpus. In this lesson, we’ll journey through the process of calculating TF-IDF scores and bring those abstract numbers to life through Python-based visualizations.
Data source
The data used in this lesson is available on the Oxford Text Archive website. It consists of a collection of pamphlets published between 1750 and 1776 by influential authors in the British colonies. These pieces depict the debate with England over constitutional rights, showing the colonists’ understanding of their contemporary events and the conditions that precipitated the American Revolution. In this lesson, we will focus on the pamphlets of Oxenbridge Thacher, James Otis, and James Mayhew. To know more about textual data sources, check this post: ‘Where to find and how to load historical data’
Coding the past: identifying relevant words in historical documents
1. TF-IDF formula
TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical measure that indicates the importance of a word in a document taking into account how frequent the word is in other documents in the same corpus. It consists of multiplying the term frequency (TF) by the inverse document frequency (IDF), which is the logarithm of the total number of documents divided by the number of documents containing the term. The formula is as follows:
Where:
- \(w_{ij}\) is the tf-idf weight for word \(i\) in document \(j\);
- \(tf_{ij}\) is the number of times word \(i\) appears in document \(j\) divided by the total number of words in document \(j\);
- \(N\) is the total number of documents in the corpus;
- \(df_i\) is the number of documents in the corpus that contain word \(i\).
2. TF-IDF calculation example
Suppose you want to calculate the TF-IDF weight for the word “British”, which appears 5 times in a document containing 100 words. Given a corpus containing 4 documents, with 2 documents mentioning the word “British”, TF-IDF can be calculated by:
\[w_{British} = \frac{5}{100} * log(\frac{4}{2}) = 0.015\]TF-IDF increases as the term frequency increases, but it decreases as the number of times the word appears in other documents in the corpus increases. Variations of the TF-IDF weighting scheme are often used by search engines as a central tool in scoring and ranking a document’s relevance given a user query.
3. Preprocessing text data in Python
We will be using the same functions from the lesson Visualizing Text Data to preprocess the text data. The functions are:
load_text
: loads the text from a txt file and returns a list of words;prepare_text
: preprocesses the text by removing stopwords, removing words with less than 3 characters, and transforming all words to lower case;count_freq
: counts the frequency of each word in the the document and returns a dataframe with the results.
content_copy Copy
After loading the functions above, we can use them to preprocess the text data.
Now we load the manifests of three authors: Oxenbridge Thacher, James Otis, and James Mayhew. The results are stored in three lists called thacher
, otis
, and mayhew
. After that, we preprocess the text data using the function prepare_text
and count the frequency of each word in each document using the function count_freq
. Results are stored in three dataframes called thacher_df
, otis_df
, and mayhew_df
.
content_copy Copy
3. Calculating term frequencies for each document
As we have seen, the first component to calculate the TF-IDF weight is the term frequency. We can calculate the term frequency for each document by dividing the number of times a word appears in the document by the total number of words in it.
content_copy Copy
4. Calculating how many times a word appears in each document in the corpus
To calculate \(df_i\), we will left join all our dataframes. Then, for each row (word) we will sum the number of times its term frequency is greater than zero. In other words, we will count in how many documents the word appears at least once. This value will vary from 1 to 3, since we have three documents in our corpus. This count can be made with the pandas method ne()
that checks if the values in the columns specified are not equal (ne) to zero. The results are booleans that can be summed to get the number of documents in which the word appears.
content_copy Copy
5. TF-IDF calculation
Finally we have all elements to calculate the TF-IDF weight. Note that we will be using base 10 logarithms. To calculate the logarithm, we can use the library math
and its method log10()
. We use the apply()
method to apply the logarithm to each row of dfi and then multiply it by the term frequency. The results are stored in three new columns called TF-IDF_thacher
, TF-IDF_otis
, and TF-IDF_mayhew
. Note that dfi varies from 1 to 3, so when the word appears in all three documents, the logarithm element will be zero and consequently TF-IDF will be zero as well (\(log_{10} 1 = 0\)).
content_copy Copy
5. TF-IDF Visualization
Now we can compare how the two methods define the 10 most important words in each document. Keep in mind that the term frequency does not account for the words in other documents of the corpus while TF-IDF does. TF logic is that the most important words are the ones that appear the most in the document. TF-IDF logic is that the most important words are the ones that appear the most in the document but not in the other documents of the corpus. TF-IDF is more sophisticated because it helps you to distinguish one document from the others.
content_copy Copy
content_copy Copy
content_copy Copy
Please note that common words in the corpus, such as “Britain” and “government,” simply do not appear in the top 10 chart because they are present in all three documents. The intuition behind this is that these words are so common in the corpus that they do not provide much useful information. The top 10 words found with TF-IDF have a stronger explanatory power in distinguishing between the three authors. Furthermore, despite being present in two documents, the word “colonies” continues to have a strong TF-IDF score. This is because the word is not present in all three documents, and it is very common in the two documents where it appears.
Conclusions
- TF-IDF depends on two factors: the frequency of a word in a document, and the inverse frequency of the word in the corpus;
- It is possible to calculate TF-IDF scores from scratch in Python, which helps you to understand the logic behind the calculation;
- TF-IDF is a useful tool when you want to identify words that are specific to a particular document.
Comments
There are currently no comments on this article, be the first to add one below
Add a Comment
If you are looking for a response to your comment, either leave your email address or check back on this page periodically.