Text Data Visualization
What you will learn
- Learn to load and tokenize text data into Python;
- Be able to clean your data to retain only the relevant information;
- Learn to count words in a list;
- Be comfortable with visualizing text data;
Table of Contents
Introduction
“Words have no power to impress the mind without the exquisite horror of their reality.”
Edgar Allan Poe
Have you ever found yourself submerged in text data, your eyes scanning countless words as you try to extract meaningful insights for your research? Text data visualization could be the solution you’re seeking. In our modern world, textual data, be it from historical documents or the latest tweets, has become a deep well of knowledge just waiting to be discovered.
Whether you’re tracing societal trends over time or studying the latest social media topics, analyzing and visualizing text data can be a gold mine. In this lesson, we’ll guide you on how to navigate this rich universe of words. Harnessing the strength of Natural Language Toolkit (NLTK) and the Matplotlib library, we’ll delve into strategies for text data visualization and analysis, illuminating new angles for your research.”
Data source
The data used in this lesson is available on the Oxford Text Archive website. It consists of a collection of pamphlets published between 1750 and 1776 by influential authors in the British colonies. These pieces depict the debate with England over constitutional rights, showing the colonists’ understanding of their contemporary events and the conditions that precipitated the American Revolution. In this lesson, we will focus on the pamphlets of Oxenbridge Thacher, James Otis, and James Mayhew. To know more about textual data sources, check this post: ‘Where to find and how to load historical data’
Coding the past: text data visualization
1. Import text file into python
To load text files in Python and reuse our code, we can build a function. Before we start to write the function, all libraries necessary for this lesson will be loaded.
Using the with
statement will ensure that the opened file is closed when the block inside it is finished. Note that we use “latin-1” encoding. The function islice()
creates an iterable object and a for loop is used to slice the file into chunks (lines). Each line is appended to the list my_text
.
word_tokenize
is a function from the NLTK library that splits a sentence into words. All the sentences are then split into words and stored in a list. Note that the list needs to be flattened into a single list, since the tokenizer returns a list of lists. This is done with a list comprehension.
content_copy Copy
Now we load the manifests of three authors: Oxenbridge Thacher, James Otis, and James Mayhew. The results are stored in three lists called thacher
, otis
, and mayhew
.
content_copy Copy
If you check the length of the lists, you will see that Oxenbridge Thacher’s manifest has approximately 4,156 words; James Mayhew, 18,969 words; and James Otis, 34,031 words.
2. Understand nltk stopwords
In this function, we will use NLTK stopwords to remove all words that do not add any meaning to our analysis. Moreover, we transform all characters to lowercase and remove all words containing two or fewer characters.
content_copy Copy
We apply the function to the three lists of words. After the cleaning process, the number of words is reduced to less than 50% of the original size.
content_copy Copy
3. Word counter in python
The function below counts the frequency of each word and returns a dataframe with the words and their frequencies, sorted by the frequency.
content_copy Copy
4. Word count visualization
We will use the matplotlib
library to create a bar plot with the 10 most frequent words in each manifest. We use iloc
to select the first 10 rows of each dataframe. barh
creates a horizontal bar plot where the words are on the y-axis and the frequency on the x-axis. After that, we set the title of each plot and perform a series of adjustments to the plot, including the elimination of the grid, the removal of part of the frame, and the change in font and background colors. Finally we also use the tight layout function to adjust the spacing between the plots.
content_copy Copy
5. Calculate the proportion of each word and comparing the manifests
Finally, we calculate the proportion of each word in each manifest relative to the total number of words in that document and store them in a new column called “proportion”. We also create two new data frames, one for each pair of manifests: one to compare Thacher and Otis, and the other to compare Thacher and Mayhew. This is done by an outer join, using the word
column as the key. This operation keeps all the words, even the ones that are not included in both datasets, and fills the missing values with 0.
content_copy Copy
Now we will compare the three manifests by plotting the proportion of each word in Thacher on the x-axis and the proportion of the same word in Otis on the y-axis. We will use the scatter
function to create a scatter plot in which the coordinates are the frequencies of a given word in Thacher and Otis. We will also use the annotate
function to label each point with the word. The same procedure will be used to compare Thacher and Mayhew. Note that the more similar the manifests, the more points will be concentrated in the diagonal line (same frequency in both manifests).
content_copy Copy
content_copy Copy
This text data visualization highlights the fact that Thacher and Otis are more similar than Thacher and Mayhew. This is reflected in the scatterplot, where the points are more concentrated in the diagonal line in the plot relating Thacher and Otis compared to the one relating Thacher and Mayhew. This is a simple way to compare the similarity of two texts. We know, for example, that, while Thacher talks a lot about “colonies”, Mayhew talks a lot about “god”.
Conclusions
- You can tokenize text data with the NLTK library method
word_tokenize
; - With list comprehensions, you can treat text to eliminate irrelevant characters and words;
- Matplotlib is an excellent option for text data visualization
Comments
There are currently no comments on this article, be the first to add one below
Add a Comment
If you are looking for a response to your comment, either leave your email address or check back on this page periodically.