Computational Text Analysis

Module 03 GESIS Fall Seminar “Introduction to Computational Social Science”

Johannes B. Gruber

GESIS

John McLevey

University of Waterloo

Schedule: GESIS Fall Seminar in Computational Social Science

Course Schedule
time Session
Day 1 Introduction to Computational Social Science
Day 2 Obtaining Data
Day 3 Computational Network Analysis
Day 4 Computational Text Analysis
Day 5 Large Language Models in the Social Sciences

Plan for today

  • What is Computational Text Analysis?
  • Dictionary-based Approaches
  • Supervised Machine Learning
  • Unsupervised Machine Learning
  • Word Embeddings
  • generative Large Language Models

What is Computational Text Analysis?

Basics

  • Definition: Computational Text Analysis involves using computer algorithms and techniques to process, analyse, and derive insights from textual data
  • Data Sources: Textual data can come from various sources such as social media posts, news articles, academic papers, emails, forums, and other digital communications

Text/Content Analysis in Social Science

  • Researchers have long recognised that much of what is social interaction is expressed through words
  • Traditionally these texts are analysed using content analysis in one of several forms (e.g., Quantitative Content Analysis, Discourse Analysis or Hermeneutic Content Analysis)
  • The goal was to understand what actors are saying and writing and sometimes why they might say or write something
  • The steps could involve theory/category building, classification of text and finding relations between texts/actors
  • But scholars struggle with volume: there are often simply too many relevant texts
  • At least linear increase of costs with larger text corpora (structured collections of text)

Promises of Automatic Content Analysis

  • Systematic analysis of large-scale text collections without massive funding support
  • Depending on the method, results are available almost immediately
  • Perfect reliability in the sense that presented the same data and method, the computer will generate same results

Pitfalls of Automatic Content Analysis: Four Principles (Grimmer and Stewart 2013)

  1. All quantitative models of language are wrong—but some are useful
  2. Quantitative methods augment humans, not replace them
  3. There is no globally best method for automated text analysis
  4. Validate, Validate, Validate!
  • Problem with 4: The discipline is still young and validation methods are often not canonical or do not exist yet.

text-as-data methods

From Grimmer and Stewart (2013)

text-as-data methods

From Boumans and Trilling (2015)

text-as-data methods

From myself.

Dictionary-based Approaches

What are Dictionary-based Approaches?

  • Lexicon/Dictionary: the words in a language and their meaning
  • Lexicon/Dictionary-based approaches: simply count how often pre-defined words appear to infer meaning of text
  • Wordcounts are usualy used to categorise text (e.g., non-/relevant, positive/negative, a-/political)
  • To infer category from count, researchers define mapping function (e.g., N positive terms > N negative terms = positive text)
  • Like ‘normal’ dictionaries: several forms of the word carry same meaning, expressed through wildcards (e.g., econom*) or regular expressions (e.g., econom.+) (matches economists, economic, and so on)

Why choose Lexicon/Dictionary-based Approaches?

  • Fully transparent even without technical knowledge
  • Lightweight to run, even on enormous data sets
  • Easy to implement it for nonconsumptive research (e.g., Google Books let’s you search, but not read/consume books)
  • Valid choice under 3 conditions (Atteveldt, Trilling, and Arcíla (2021)):
    1. Variable we want to code is manifest and concrete rather than latent and abstract: names of actors, specific physical objects, specific phrases, etc., rather than feelings, frames, or topics.
    2. All synonyms to be included must be known beforehand.
    3. And third, the dictionary entries must not have multiple meanings.

Setup Environment

  • Switch to Quarto View
library(ngramr)
library(tidyverse)
library(tidytext)
library(yardstick)

Example: Non-Consupmtive Research with Google Books

Taken from Duneier (2017): Ghetto: The Invention of a Place, the History of an Idea

  • RQ: How did the meaning of ghetto change over time?
  • Method: Non-Consumptive Research with the Google Books Ngram Viewer
ng  <- ngram(
  phrases = c("ghetto", 
              "(Warsaw ghetto) + (Jewish ghetto)", 
              "(black ghetto) + (negro ghetto)"), 
  year_start = 1920,
  year_end = 1975,
  smoothing = 0,
  count = TRUE
) |> 
  group_by(Year) |> 
  mutate(pct = Frequency / Frequency[Phrase == "ghetto"]) |> 
  filter(Phrase != "ghetto")
# Ngram data table
# Phrases:      
# Case-sensitive:   
# Corpuses:     
# Smoothing:        
# Years:        1920-1975

  Year Phrase    Frequency Corpus Parent  type
1 1920 ghetto 5.954661e-08     en        NGRAM
2 1921 ghetto 4.997540e-08     en        NGRAM
3 1922 ghetto 6.947938e-08     en        NGRAM
4 1923 ghetto 7.295493e-08     en        NGRAM
5 1924 ghetto 7.934730e-08     en        NGRAM
6 1925 ghetto 2.008454e-07     en        NGRAM
ggplot(ng, aes(x = Year, y = pct, colour = Phrase)) +
  geom_line() +
  theme(legend.position = "bottom")

Example 2: Sentiment Analysis

This part is taken from Atteveldt, Trilling, and Arcíla (2021) Chapter 11.2. We first get the data for this, which consists of movie reviews from the IMDB database (Maas et al. 2011).

# You can ignore this part where I download and process the data. The code is
# not really applicable for other dataset. But I left it in here in case you
# find it interesting.
data_file <- "data/imdb.rds"
if (!file.exists(data_file)) {
  message("Downloading data")
  temp <- file.path(tempdir(), "imdb") 
  dir.create(temp, recursive = TRUE)
  curl::curl_download("https://cssbook.net/d/aclImdb_v1.tar.gz",
                      file.path(temp, "imdb.tar.gz"), quiet = FALSE)
  untar(file.path(temp, "imdb.tar.gz"), exdir = temp)
  files <- list.files(temp, 
                      pattern = ".txt", 
                      recursive = TRUE,
                      full.names = TRUE)
  imdb <- map(files, function(f) {
    tibble(
      file = f,
      text = readLines(f, warn = FALSE)
    )
  }) |> 
    bind_rows() |> 
    mutate(label = str_extract(file, "/pos/|/neg/"),
           label = str_remove_all(label, "/"),
           label = factor(label)) |>
    filter(!is.na(label)) |>
    select(-file) |> 
    # adding unique IDs for later
    mutate(id = row_number())
  saveRDS(imdb, data_file)
} else {
  message("Using cached data")
  imdb <- readRDS(data_file)
}

Getting a dictionary

We download a dictionary from the Computational Analysis of Communication website (Atteveldt, Trilling, and Arcíla 2021), which consists of a list of positive, and one list of negative words.

poswords <- "https://cssbook.net/d/positive.txt"
negwords <- "https://cssbook.net/d/negative.txt"
sentiment_dict <- bind_rows(
  tibble(word = scan(poswords, what = "character"), value = 1),
  tibble(word = scan(negwords, what = "character"), value = -1)
)
sentiment_dict[c(1:5, 5660:5664), ]
# A tibble: 10 × 2
   word      value
   <chr>     <dbl>
 1 a+            1
 2 abound        1
 3 abounds       1
 4 abundance     1
 5 abundant      1
 6 zaps         -1
 7 zealot       -1
 8 zealous      -1
 9 zealously    -1
10 zombie       -1

Applying dictionary

We then go through all reviews and construct a sentiment score by looking up each word and adding up its score:

scores_df <- imdb |> 
  # For speed, we only take the first 100 reviews
  head(100) |> 
  # This splits up the texts into its individual words
  unnest_tokens(output = "word", input = "text", drop = FALSE) |> 
  # We attach the sentiment_dict to the text data.frame. inner_join drops
  # rows where the word is not in both data.frames
  inner_join(sentiment_dict, by = "word") |>
  # For each text, we calcuate the sum of values
  group_by(id) |> 
  summarise(senti_score = sum(value),
            text = head(text, 1))

head(scores_df)
# A tibble: 6 × 3
     id senti_score text                                                        
  <int>       <dbl> <chr>                                                       
1     1           3 "Once again Mr. Costner has dragged out a movie for far lon…
2     2           1 "This is a pale imitation of 'Officer and a Gentleman.' The…
3     3          16 "Years ago, when DARLING LILI played on TV, it was always t…
4     4           2 "I was looking forward to this movie. Trustworthy actors, i…
5     5           9 "First of all, I would like to say that I am a fan of all o…
6     6           5 "This is an example of why the majority of action films are…

Applying dictionary

More commonly, people normalize the absolute count of positive and negative words to construct a score:

  • -1 (most negative sentiment) to +1 (most positive sentiment).
  • mapping function
    • N positive terms >= N negative terms = positive text
    • N positive terms < N negative terms = negative text
scores_df <- imdb |> 
  # For speed, we only take the first 100 reviews
  head(100) |>  
  unnest_tokens(output = "token", input = "text", drop = FALSE) |> 
  inner_join(sentiment_dict, by = c("token" = "word")) |> 
  group_by(id) |>
  # here, we normalise the outcome and assign a sentiment category
  summarise(senti_score = sum(value) / n(),
            sentiment = ifelse(senti_score >= 0, "pos", "neg"),
            text = head(text, 1))

head(scores_df)
# A tibble: 6 × 4
     id senti_score sentiment text                                              
  <int>       <dbl> <chr>     <chr>                                             
1     1      0.333  pos       "Once again Mr. Costner has dragged out a movie f…
2     2      0.0667 pos       "This is a pale imitation of 'Officer and a Gentl…
3     3      0.222  pos       "Years ago, when DARLING LILI played on TV, it wa…
4     4      0.125  pos       "I was looking forward to this movie. Trustworthy…
5     5      0.429  pos       "First of all, I would like to say that I am a fa…
6     6      0.263  pos       "This is an example of why the majority of action…

Plotting the scores

We can plot these results to get an impression how often each category was predicted and how strong the senti_score was in these cases.

scores_df |> 
  mutate(id = fct_reorder(as.character(id), senti_score)) |> 
  ggplot(aes(x = senti_score, y = id, fill = sentiment)) +
  geom_col() +
  labs(y = NULL, fill = NULL)

Validation

We can validate this approach by comparing the measured sentiment to the real sentiment, as given in the dataset:

validation_df <- scores_df |> 
  select(-text) |> 
  left_join(imdb, by = "id")

# have a look at the new data.frame
validation_df |> 
  select(id, label, sentiment, senti_score)
# A tibble: 100 × 4
      id label sentiment senti_score
   <int> <fct> <chr>           <dbl>
 1     1 neg   pos            0.333 
 2     2 neg   pos            0.0667
 3     3 neg   pos            0.222 
 4     4 neg   pos            0.125 
 5     5 neg   pos            0.429 
 6     6 neg   pos            0.263 
 7     7 neg   neg           -0.167 
 8     8 neg   pos            0.217 
 9     9 neg   pos            0.133 
10    10 neg   neg           -0.4   
# ℹ 90 more rows

An easy way to validate performance is to calculate how often the prediction and the real sentiment match:

validation_df |> 
  count(match = label == sentiment)
# A tibble: 2 × 2
  match     n
  <lgl> <int>
1 FALSE    54
2 TRUE     46

However, the absolute count of matches, or accuracy, is prone to errors, since it does not take into account chance. For example, by taking only the first 100 reviews, we happen to have gather data that has just negative cases:

validation_df |> 
  count(label)
# A tibble: 1 × 2
  label     n
  <fct> <int>
1 neg     100

So while optimising our mapping function we could accidentally make a wrong adjustment:

scores_df_error <- imdb |> 
  head(100) |>  
  unnest_tokens(output = "token", input = "text", drop = FALSE) |> 
  inner_join(sentiment_dict, by = c("token" = "word")) |> 
  group_by(id) |>
  summarise(senti_score = sum(value) / n(),
            sentiment = ifelse(senti_score >= 0, "neg", "neg"),
            #                 see the error here --^
            text = head(text, 1))

scores_df_error |> 
  select(-text) |> 
  left_join(imdb, by = "id") |> 
  count(match = label == sentiment)
# A tibble: 1 × 2
  match     n
  <lgl> <int>
1 TRUE    100

Now suddenly our accuracy is perfect! This is why we use a couple of metrics that control for chance:

conf_matrix <- table(validation_df$label, validation_df$sentiment)
ml_metrics <- metric_set(accuracy, precision, recall, f_meas, kap)
conf_matrix
     
      neg pos
  neg  46  54
  pos   0   0
ml_metrics(conf_matrix)
# A tibble: 5 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 accuracy  binary         0.46 
2 precision binary         0.46 
3 recall    binary         1    
4 f_meas    binary         0.630
5 kap       binary         0    

quick explanation of the common metrics

  1. Precision: This is the ratio of correctly predicted positive observations to the total predicted positives. High precision means that an algorithm returned more relevant results than irrelevant ones.

  2. Recall (Sensitivity): This is the ratio of correctly predicted positive observations to the all observations in actual class. High recall means that an algorithm returned most of the relevant results.

  3. F1 Score: This is the weighted average of Precision and Recall. It tries to find the balance between precision and recall. High F1 score means that both recall and precision are high.

We often use a confusion matrix to calculate these metrics. A confusion matrix is a 2x2 table that contains 4 outputs provided by the binary classifier. The terminology can vary, but it’s often formatted as follows:

Actual Positive Actual Negative
Predicted Positive True Positive (TP) False Positive (FP)
Predicted Negative False Negative (FN) True Negative (TN)

Based on this matrix:

  • Precision = TP / (TP + FP)
  • Recall = TP / (TP + FN)
  • F1 Score = 2(Recall Precision) / (Recall + Precision)

A quick example

Let’s say we have a binary classifier that’s being used to predict whether a given email is “spam” (positive class) or “not spam” (negative class). Suppose our classifier is tested on a dataset of 100 emails, which include 20 actual spam messages and 80 actual non-spam messages. The classifier outputs the following results:

Actual Spam Actual Not Spam
Predicted Spam 15 (TP) 5 (FP)
Predicted Not Spam 5 (FN) 75 (TN)

Here, the precision, recall, and F1 score would be calculated as:

  • Precision = TP / (TP + FP) = 15 / (15 + 5) = 0.75
  • Recall = TP / (TP + FN) = 15 / (15 + 5) = 0.75
  • F1 Score = 2(Recall Precision) / (Recall + Precision) = 2(0.75 0.75) / (0.75 + 0.75) = 0.75

So in this case, the classifier has a precision of 0.75 (meaning that 75% of the emails it labeled as “spam” were actually spam), a recall of 0.75 (meaning that it correctly identified 75% of the total spam emails), and an F1 score of 0.75 (giving a balance between precision and recall). Which of these metrics is most useful depends on the task. If you detect cancer in patients, false positives are not great, but false negatives might be deadly! In this case you should optimise the recall and only look at the other metrics as an afterthought. However, most of the time, we want to get a good F1 value (rule of thumb is above 0.7 or better 0.8). Note that in most cases, we do not really care about a “positive” class as both classes are equally important. And often, we have more than 2 classes. A good strategy here is to calculate F1 for each class and reporting all values plus the average.

Issues

  • The more terms we add to our dictionary, the more false positives we will get
  • Building a good dictionary is a lot of work (complexity-resource plot):
    • Negation and bag-of-word issues (“not good” will be counted as positive + modifiers such as “very good”)
    • “great” should be more positive than “good”
  • Negative image of dictionaries in academia
    • Many negative examples where dictionaries were applied often outside of the domain they had been developed
    • Wrong believe that popular off-the-shelf dictionaries do not need validation
    • Many papers that show that dictionaries do not perform as well as machine learning: e.g. Van Atteveldt, Van Der Velden, and Boukes (2021); González-Bailón and Paltoglou (2015); Boukes et al. (2020)

Now that you know about dictionaries, remember to apply them only under some circumstances:

  1. When no other method is available, e.g., in data retrieval or nonconsumptive research
  2. Variable we want to code is manifest and concrete rather than latent and abstract: names of actors, specific physical objects, specific phrases, etc., rather than feelings, frames, or topics.
  3. All synonyms to be included must be known beforehand.
  4. And third, the dictionary entries must not have multiple meanings.

Supervised Machine Learning

What are Supervised Machine Learning Approaches?

What are Supervised Machine Learning Approaches?

Steps in SML

We proceed in 4 (or 5) steps:

  1. preprocessing the incoming text
  2. splitting the dataset into training and a test set (which is not included in the model and just used for validation)
  3. fitting (or training) the model
  4. using the test set to compare predictions against the real values for validation
  5. (using the model on new documents)

Why choose Supervised Machine Learning Approaches?

  • Usually more accurate than dictionaries
  • Straightforward way to construct and validate models
  • Even when models are not valid (enough) you still have hand coded data that you can fall back on

What should you use?

Several frameworks in R:

  • RTextTools: includes the most important algorithms, but outdated
  • caret: includes a huge variety of models and comes with robust helper functions, but superseded
  • quanteda: nice suite of text analysis functions, including quanteda.textmodels
  • tidymodels: successor of caret with an extensive collection of algorithms, preprocessing, validation and comparison functions

1. Using textrecipes to turn text into features

  • textrecipes (an extension of the recipes package): handles all preprocessing in a unified syntax
  • syntax is similar for a large variety of machine learning (including, e.g., OLS or logisitc regression)

Define recipe:

library(tidymodels)
library(textrecipes)
imdb_rec <- recipe(label ~ text, data = imdb) |>
  # turn text into features/tokens/words
  step_tokenize(all_predictors()) |>
  # remove stopwords
  step_stopwords(all_predictors(), language = "en") |> 
  step_tokenfilter(all_predictors(), min_times = 3) |>
  step_tf(all_predictors())

Bake recipe with ingredients (data):

imdb_rec |> 
  prep(head(imdb, 10)) |>
  bake(new_data = NULL)
# A tibble: 10 × 101
   label tf_text_act tf_text_acting tf_text_action tf_text_actors tf_text_almost
   <fct>       <int>          <int>          <int>          <int>          <int>
 1 neg             0              0              0              0              0
 2 neg             0              0              1              0              0
 3 neg             1              0              2              0              1
 4 neg             0              0              0              1              1
 5 neg             0              0              0              1              1
 6 neg             0              3              2              0              1
 7 neg             2              1              0              2              0
 8 neg             0              2              1              0              0
 9 neg             0              0              1              0              0
10 neg             0              0              0              0              0
# ℹ 95 more variables: tf_text_also <int>, tf_text_always <int>,
#   tf_text_andrews <int>, tf_text_around <int>, tf_text_back <int>,
#   tf_text_badly <int>, tf_text_best <int>, tf_text_better <int>,
#   tf_text_bill <int>, tf_text_blake <int>, tf_text_box <int>,
#   tf_text_br <int>, tf_text_brass <int>, tf_text_can <int>,
#   tf_text_care <int>, tf_text_character <int>, tf_text_characters <int>,
#   tf_text_comes <int>, tf_text_costner <int>, tf_text_created <int>, …

Note

Stopwords are words that are thought to add no meaning to a text and can be safely removed. Many text analysis packages come with their own selection, but whether it’s useful depends on context! You can look at stopwords::stopwords() to see one selection.

2. Splitting the dataset

  • important to leave out a portion of the data for validation:
set.seed(1)
split <- initial_split(
  data = imdb, 
  prop = 3 / 4,   # the prop is the default, I just wanted to make that visible
  strata = label  # this makes sure the prevalence of labels is still the same afterwards
) 
imdb_train <- training(split)
imdb_test <- testing(split)

Important

Making the split only once is not good practice, as which texts end up in the training and test portion of the data can influence the performance metrics of the model. Some researchers have even been caught trying many different seeds to find one that delivers better looking results. What you should do instead is to use the bootstraps() to fit multiple different combinations. Check out one of Julia Silge’s videos if you want to learn more.

3. Fitting a model

Let’s start with a naïve bayes model, for which we need another package which contains this “engine” (you don’t need to know this, the next step would tell you to load the package):

library(discrim)
nb_spec <- naive_Bayes() |>
  set_mode("classification") |>
  set_engine("naivebayes")

Now we bring the recipe and model together in a new workflow:

imdb_wf_nb <- workflow() |> 
  add_recipe(imdb_rec) |> 
  add_model(nb_spec)

Now, we fit the model:

model_nb <- fit(imdb_wf_nb, data = imdb_train)

4. Evaluating the model

imdb_prediction <- imdb_test |> 
  bind_cols(predict(model_nb, new_data = imdb_test)) |>
  rename(truth = label, estimate = .pred_class)

conf_mat(imdb_prediction, truth, estimate)
          Truth
Prediction  neg  pos
       neg 5861 4142
       pos  389 2108
library(gt)
my_metrics <- metric_set(accuracy, kap, precision, recall, f_meas)

my_metrics(imdb_prediction, truth = truth, estimate = estimate) |> 
  # I use gt to make the table look a bit nicer, but it's optional
  gt() |> 
  data_color(
    columns = .estimate,
    fn = scales::col_numeric(
      palette = c("red", "orange", "green"),
      domain = c(0, 1)
    )
  )
.metric .estimator .estimate
accuracy binary 0.6375200
kap binary 0.2750400
precision binary 0.5859242
recall binary 0.9377600
f_meas binary 0.7212207

Alternative models

Let’s walk through the steps again to train a logistic regression model:

# choosing the model
lasso_spec <- logistic_reg(penalty = 0.01, mixture = 1) |>
  set_mode("classification") |>
  set_engine("glmnet")

# bring preprocessing and model together in a workflow
imdb_wf_lasso <- workflow() |> 
  add_recipe(imdb_rec) |> 
  add_model(lasso_spec)

# train the model
model_lasso <- fit(imdb_wf_lasso, data = imdb_train)

# evaluate
imdb_prediction_lasso <- imdb_test |> 
  bind_cols(predict(model_lasso, new_data = imdb_test)) |> 
  rename(truth = label, estimate = .pred_class)

my_metrics <- metric_set(accuracy, kap, precision, recall, f_meas)

my_metrics(imdb_prediction_lasso, truth = truth, estimate = estimate) |> 
  gt() |> 
  data_color(
    columns = .estimate,
    colors = scales::col_numeric(
      palette = c("red", "orange", "green"),
      domain = c(0, 1)
    )
  )
.metric .estimator .estimate
accuracy binary 0.7252800
kap binary 0.4505600
precision binary 0.7390492
recall binary 0.6964800
f_meas binary 0.7171334

Additional step: looking inside the model

A way to make sense of a model is to look at coefficients. This tells us essentially, what the model thinks are important terms to say a document is positive or negative.

model_lasso |> 
  extract_fit_engine() |> 
  vip::vi() |>
  group_by(Sign) |>
  slice_max(Importance, n = 20) |> 
  ungroup() |>
  mutate(
    Variable = str_remove(Variable, "tf_text_"),
    Variable = fct_reorder(Variable, Importance)
  ) |>
  ggplot(aes(x = Importance, y = Variable, fill = Sign)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Sign, scales = "free_y") +
  labs(y = NULL)

Issues

  • Many training documents needed to build a valid model
  • Even then, there is no guarantee you get good evaluation results
  • Many choices in terms of models and preprocessing, but no one-model-fits all
  • Models are specific to the training data and often will not work in other domains

Learn more

  • Emil Hvitfeldt, Julia Silge: Supervised Machine Learning for Text Analysis in R, https://smltar.com/

Unsupervised Machine Learning

What are Unsupervised Machine Learning Approaches?

  • The goal of topic modelling is to automatically extract topics from documents without requiring human supervision
  • Explore text collection by letting the computer figure out which words and which texts belong together
  • Commonly Latent Dirichlet Allocation is used for probabilistic topic modelling (although many consider BERTopic to be better now)
  • Newer alternatives like BERTopic improve
  • Although the idea of an algorithm figuring out topics might sound close to magical (mostly because people have too high expectations of what these ‘topics’ are), and the mathematics might be a bit challenging, it is actually really simple fit an LDA topic model in R

Playful exploration1

A good first step towards understanding what topic models are and how they can be useful, is to simply play around with them, so that’s what we’ll do first. Open the page https://lettier.com/projects/lda-topic-modeling/ (if possible, in Firefox)

Example Docs:

  1. 🐭 🐭 🐭 🐭 🐭 🐭 🐭 🐭 🐭 🐭
  2. 🐱 🐱 🐱 🐱 🐱 🐱 🐱 🐱 🐱 🐱
  3. 🐶 🐶 🐶 🐶 🐶 🐶 🐶 🐶 🐶 🐶
  4. 🐭 🐭 🐭 🐭 🐭 🐭 🐭 🐭 🐭 🐭 🐱 🐱 🐱 🐱 🐱 🐱 🐱 🐱 🐱 🐱 🐶 🐶 🐶 🐶 🐶 🐶 🐶 🐶 🐶 🐶

On top, the texts are turned into a document feature matrix.

Playful exploration

What LDA does (essentially):

  • reduce the dimensions of a text into a set of ‘topics’ that are easier to interpret
  • express the relation between cases and topics, as well as variables and topics in two matrices
  • If you have used PCA, MDS or factor analysis before, this is essentially the same process

Playful exploration

The first matrix describes the probability a feature belongs to a topic. We call this the feature-topic-matrix:

This case makes it pretty clear:

  • mice belong into the first topic
  • cats belong into the second
  • dogs belong into the third topic

Playful exploration

The second table does the same, but with documents. We call this the document-term-matrix:

  • The first document has a high probability to belong to the first topic, because it is full of mice
  • The second document has a high probability to belong to the second topic, because it is full of cats
  • The third document has a high probability to belong to the third topic, because it is full of dogs

Playful exploration

What makes LDA often seem magical, is how well it works for text. This is because the underlying assumptions fit the statistical features of document collections quite well, leading to meaningful and interpretable categories that make it easy to explore and summarise what is happening in text.

Playful exploration

The short example above shows the three broad steps of topic modelling:

  1. Create a document-feature-matrix from your documents (and preprocess it)
  2. Fit the topic model
  3. Analyze (and validate) the results

Before we go on, here are a few things I want you to try:

  1. Change the Alpha and Beta values
  2. Add and replace your own text
  3. Change the number of topics

(0) Obtaining the data

We use a subset of the Parlspeech corpus (Rauh and Schwalbach 2020), spanning the 18th legislative period of the Bundestag.

# Again, you can ignore this part where I download and process the data. The code is
# not really applicable for other dataset. But I left it in here in case you
# find it interesting..
data_file <- "data/bundestag18_speeches.rds"
if (!file.exists(data_file)) {
  library(dataverse)
  Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
  ds <- get_dataset("doi:10.7910/DVN/L4OAKN")
  as_tibble(ds[["files"]][, c("label", "id")])
  
  url <- get_file(3758785, 
                  dataset = "doi:10.7910/DVN/ZY3RV7",
                  return_url = TRUE)
  
  curl::curl_download(url, destfile = "data/Corp_Bundestag_V2.rds", quiet = FALSE)
  
  bundestag18 <- readRDS("data/Corp_Bundestag_V2.rds") |>
    mutate(date = ymd(date),
           speechnumber = as.integer(speechnumber)) |>
    filter(date >= "2013-10-22",
           date <= "2017-10-24") |>
  mutate(doc_id = paste0(date, "-", speechnumber)) |>
  as_tibble()
  saveRDS(bundestag18, data_file)
} else {
  bundestag18 <- readRDS(data_file)
}

(1) Creating a DFM

We first tidy the documents:

bundestag18_tidy <- bundestag18 |>  
  unnest_tokens(output = word, input = "text")

Secondly, we do some light cleaning:

  • removing stopwords
  • removing rare terms
  • removing features that aren’t words
bundestag18_tidy_clean <- bundestag18_tidy |>
  filter(!word %in% c(stopwords::stopwords(language = "de"), 
                      "dass", "kollege", "kollegin", "herr", "frau", "dr")) |>
  group_by(word) |>
  filter(
    n() > 10,                    # keep features that appear more than 10 times
    !str_detect(word, "[^a-z]")  # keep features that consist only of characters
  ) |>
  ungroup()

print(glue::glue(
  "Cleaning removed {length(unique(bundestag18_tidy$word)) - length(unique(bundestag18_tidy_clean$word))} ",
  "unique words and ",
  "{length(unique(bundestag18_tidy$doc_id)) - length(unique(bundestag18_tidy_clean$doc_id))} documents. ",
  "{length(unique(bundestag18_tidy$word))} unique words remain in ",
  "{length(unique(bundestag18_tidy$doc_id))} documents"
))
Cleaning removed 197067 unique words and 986 documents. 224188 unique words remain in 52784 documents

Now we can create a document-feature-matrix. This is a (sparse) matrix showing how often each term (column) occurs in each document (row):

bundestag18_dfm <- bundestag18_tidy_clean %>%
  count(doc_id, word) |>
  cast_dfm(doc_id, word, n)

We can inspect a corner of the dfm by casting it to a regular (dense) matrix:

as.matrix(bundestag18_dfm[1:5, 1:5])
               features
docs            abgeordnete abgeordneten abgeordneter abs abschluss
  2013-10-22-1            2            2            2   1         1
  2013-10-22-10           0            1            0   0         0
  2013-10-22-11           0            0            0   0         0
  2013-10-22-12           0            0            0   0         0
  2013-10-22-13           0            0            0   0         0

(2) Running the topic model

  • many different ways to fit a topic model (topicmodels or lda or stm package, stay away from mallet, try BERTopic if you know a little python)
  • I use textmodel_lda() function from the seededlda package due to the clean syntax
library(seededlda)
k <- 10
set.seed(2024) # Note, that we use `set.seed` to create reproducible results
lda_model <- textmodel_lda(
  bundestag18_dfm,
  k = k,              # the number of topics is chosen at random for demonstration purposes
  max_iter = 200L,    # I would not usually recommend that few iterations, it's just so it runs quicker here
  alpha = 50L / k,    # these are the default values in the package
  beta = 0.1,
  verbose = TRUE
)
Fitting LDA with 10 topics
 ...initializing
 ...Gibbs sampling in 200 iterations
 ......iteration 100 elapsed time: 37.83 seconds (delta: 0.09%)
 ......iteration 200 elapsed time: 72.02 seconds (delta: -0.00%)
 ...computing theta and phi
 ...complete

(3) Inspecting and analysing the results

Word-topic probabilities

bundestag18_ftm <- lda_model$phi |>
  as.data.frame() |># converting the matrix into a data.frame makes sure it plays nicely with the tidyverse
  rownames_to_column("topic") |># the topic names/numbers are stored in the row.names, I move them to a column
  mutate(topic = fct_inorder(topic)) |># turn to factor to maintain the correct order
  pivot_longer(-topic, names_to = "word", values_to = "phi")

topic_topwords_plot <- bundestag18_ftm |># turn to long for plotting
  group_by(topic) |># using group_by and slice_max, we keep only the top 10 values from each topic
  slice_max(order_by = phi, n = 15) |>
  # using reorder_within does some magic for a nicer plot
  mutate(word = tidytext::reorder_within(word, by = phi, within = topic)) |>
  # from here on, we just make a bar plot with facets
  ggplot(aes(x = phi, y = word, fill = topic)) +                               
  geom_col() +
  tidytext::scale_y_reordered() +
  facet_wrap(~topic, ncol = 2, scales = "free_y")
topic_topwords_plot

Going forward, I would now name these topics. I found this particular format in a Excel sheet helpful.

lda_model$phi |>
  as.data.frame() |>
  rowid_to_column("topic") |>
  pivot_longer(-topic, names_to = "word", values_to = "phi") |>
  group_by(topic) |>
  slice_max(order_by = phi, n = 20) |>
  mutate(top = row_number()) |>
  pivot_wider(id_cols = top, names_from = topic, values_from = word) |>
  # Add an extra row where you can write in topic names
  add_row(top = NA, .before = 1) %>%
  rio::export("topicsmodel_topwords.xlsx")

Topics per document

Similarly to above, we can also extract to topics per document:

bundestag18_dtm <- lda_model$theta |>
  as.data.frame() |>
  rownames_to_column("doc_id") |>
  as_tibble()
bundestag18_dtm
# A tibble: 51,798 × 11
   doc_id topic1 topic2 topic3 topic4 topic5 topic6 topic7 topic8 topic9 topic10
   <chr>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
 1 2013-… 0.0936 0.171  0.355  0.0195 0.0890 0.0667 0.0371 0.0306 0.103   0.0343
 2 2013-… 0.260  0.0632 0.123  0.0421 0.0632 0.0246 0.0772 0.214  0.0421  0.0912
 3 2013-… 0.0980 0.0980 0.0980 0.0980 0.0980 0.118  0.0980 0.0980 0.0980  0.0980
 4 2013-… 0.112  0.0875 0.112  0.075  0.075  0.0625 0.138  0.15   0.1     0.0875
 5 2013-… 0.0909 0.0909 0.109  0.0909 0.0909 0.127  0.0909 0.0909 0.0909  0.127 
 6 2013-… 0.255  0.0691 0.112  0.0372 0.0638 0.0426 0.122  0.0904 0.0585  0.149 
 7 2013-… 0.0943 0.0943 0.0943 0.0943 0.0943 0.113  0.0943 0.0943 0.0943  0.132 
 8 2013-… 0.257  0.0616 0.0978 0.0290 0.0435 0.0399 0.0833 0.207  0.0652  0.116 
 9 2013-… 0.0865 0.0602 0.0564 0.0263 0.0338 0.0677 0.0526 0.113  0.0376  0.466 
10 2013-… 0.0841 0.0561 0.0748 0.0654 0.0841 0.0841 0.0841 0.112  0.0841  0.271 
# ℹ 51,788 more rows

We can tidy this and join the results back with the original metadata using the doc_id:

bundestag18_dtm_tidy <- bundestag18_dtm |>
  pivot_longer(-doc_id, names_to = "topic", values_to = "theta") |>
  # again this is to keep track of the order as it is otherwise order by alphabet
  mutate(topic = fct_inorder(topic)) |>
  left_join(bundestag18 |> select(-text), by = "doc_id")
bundestag18_dtm_tidy
# A tibble: 517,980 × 13
   doc_id       topic    theta date       agenda      speechnumber speaker party
   <chr>        <fct>    <dbl> <date>     <chr>              <int> <chr>   <chr>
 1 2013-10-22-1 topic1  0.0936 2013-10-22 SITZUNGSBE…            1 Heinz … <NA> 
 2 2013-10-22-1 topic2  0.171  2013-10-22 SITZUNGSBE…            1 Heinz … <NA> 
 3 2013-10-22-1 topic3  0.355  2013-10-22 SITZUNGSBE…            1 Heinz … <NA> 
 4 2013-10-22-1 topic4  0.0195 2013-10-22 SITZUNGSBE…            1 Heinz … <NA> 
 5 2013-10-22-1 topic5  0.0890 2013-10-22 SITZUNGSBE…            1 Heinz … <NA> 
 6 2013-10-22-1 topic6  0.0667 2013-10-22 SITZUNGSBE…            1 Heinz … <NA> 
 7 2013-10-22-1 topic7  0.0371 2013-10-22 SITZUNGSBE…            1 Heinz … <NA> 
 8 2013-10-22-1 topic8  0.0306 2013-10-22 SITZUNGSBE…            1 Heinz … <NA> 
 9 2013-10-22-1 topic9  0.103  2013-10-22 SITZUNGSBE…            1 Heinz … <NA> 
10 2013-10-22-1 topic10 0.0343 2013-10-22 SITZUNGSBE…            1 Heinz … <NA> 
# ℹ 517,970 more rows
# ℹ 5 more variables: party.facts.id <dbl>, chair <lgl>, terms <dbl>,
#   parliament <chr>, iso3country <chr>

Now, we can e.g. compare topic usage per party:

bundestag18_dtm_tidy |> 
  filter(!is.na(party),
         party != "independent") |>
  group_by(party, topic) |> 
  summarize(theta = mean(theta)) %>%
  ggplot(aes(x = theta, y = topic, fill = party)) + 
  geom_col(position = "dodge")  +
  scale_fill_manual(values = c(
    "PDS/LINKE" = "#BD3075",
    "SPD" = "#D71F1D",
    "GRUENE" = "#78BC1B",
    "CDU/CSU" = "#121212",
    "FDP" = "#FFCC00",
    "AfD" = "#4176C2"
  ))

Or over time:

bundestag18_dtm_tidy |>
  group_by(date = floor_date(date, "months"), topic) |> 
  summarize(theta = mean(theta)) |>
  ggplot(aes(x = date, y = theta, colour = topic)) +
  geom_line()

bundestag18_dtm_tidy |>
  group_by(date = floor_date(date, "months"), topic) |> 
  summarize(theta = mean(theta)) |>
  ggplot(aes(x = date, y = theta, colour = topic)) +
  geom_line() +
  facet_wrap(vars(topic))

Some alternative ways to explore the model

LDAvis

A popular way of exploring topics and how they overlap is the LDAvis package.

library(LDAvis)
json <- createJSON(phi = lda_model$phi,
                   theta = lda_model$theta, 
                   doc.length = quanteda::ntoken(lda_model$data),
                   vocab = quanteda::featnames(lda_model$data), 
                   term.frequency = quanteda::featfreq(lda_model$data))
serVis(json)

LDAvis helps users understand the relationships between topics and the key terms that define them.

tokenbrowser

We can check how the probabilities for documents are calculated by looking at the words and topic probabilities in their original context using the using the tokenbrowser package developed by Kasper Welbers.

We first select the 2000 tokens with the highest phi value, i.e., the ones which most clearly belong to one topic.

categories <- bundestag18_ftm %>%
  group_by(word) |>
  mutate(phi_rel = phi / sum(phi)) |>
  slice_max(order_by = phi, n = 1) |>
  ungroup() |>
  filter(phi_rel >= 0.5)

Then we attach the categories to the original tidy representation of the texts.

assignments <- bundestag18_tidy |>
  filter(doc_id %in% unique(bundestag18_tidy$doc_id)[1:5]) |>
  left_join(categories, by = "word")

Now we can look at the words that clearly belong to a topic in context of the full speeches.

library(tokenbrowser)
categorical_browser(
  assignments,
  category = as.factor(assignments$topic), 
  token_col = "word"
) |>
  browseURL()

Finding an optimal number of topics

  • The best way to find the optimal \(k\) number of topics is to interpret different models and look for the ones that seems to divide your corpus into the most meaningful topics
  • That is very cumbersome though and there are some statistical methods to make the process easier
  • The idea behind all of them is to compare the metrics of different models to narrow your search down
library(furrr)
lda_models <- "data/lda_models.rds"
# since this takes a long time, I save the results in case I want to run it again
if (!file.exists(lda_models)) {
  plan(multisession)  ## for parallel processing
  lda_fun <- function(k, max_iter = 200) {
    textmodel_lda(
      bundestag18_dfm,
      k = k,
      max_iter = max_iter,
      alpha = 50 / k,
      beta = 0.1,
      verbose = TRUE
    )
  }
  
  models_df <- tibble(
    k = c(10:20),
    model = future_map(k, lda_fun, .options = furrr_options(seed = 1))
  )
  
  saveRDS(models_df, lda_models)
} else {
  models_df <- readRDS(lda_models)
}

There is no official function in seededlda to evaluate different models. The stm package is much better here as demonstrated by Julia Silge. But since I did not want to introduce another package, I copied the functions that are currently discussed from this issue on GitHub.

semantic_coherence <- function(model, top_n = 10) {
    h <- apply(terms(model, top_n), 2, function(y) {
        d <- model$data[,y]
        e <- Matrix::Matrix(docfreq(d), nrow = nfeat(d), ncol = nfeat(d))
        f <- fcm(d, count = "boolean") + 1
        g <- Matrix::band(log(f / e), 1, ncol(f))
        sum(g)
    })
    sum(h)
}

divergence <- function(model) {
    div <- proxyC::dist(model$phi, method = "kullback")
    diag(div) <- NA
    mean(as.matrix(div), na.rm = TRUE)
}

# this one is taken from stm https://github.com/bstewart/stm/blob/master/R/exclusivity.R
exclusivity <- function(model, top_n = 10, frexw = 0.7) {
  
  tphi <- t(exp(model$phi))
  s <- rowSums(tphi)
  mat <- tphi / s # normed by columns of beta now.
  
  ex <- apply(mat, 2, rank) / nrow(mat)
  fr <- apply(tphi, 2, rank) / nrow(mat)
  frex <- 1 / (frexw / ex + (1 - frexw) / fr)
  index <- apply(tphi, 2, order, decreasing = TRUE)[1:top_n, ]
  
  out <- vector(length = ncol(tphi))
  for (i in seq_len(ncol(frex))) {
    out[i] <- sum(frex[index[, i], i])
  }
  
  return(mean(out))
}

We can now use this plot to evaluate the different models.

models_df_metrics <- models_df |>
  mutate(semantic_coherence = map_dbl(model, semantic_coherence),
         exclusivity = map_dbl(model, exclusivity),
         divergence = map_dbl(model, divergence))

models_df_metrics |>
  select(-model) |>
  pivot_longer(-k, names_to = "metric") |>
  ggplot(aes(x = k, value, color = metric)) +
  geom_line(linewidth = 1.5, alpha = 0.7, show.legend = FALSE) +
  scale_x_continuous(breaks = scales::pretty_breaks()) +
  facet_wrap(~metric, scales = "free_y") +
  labs(x = "K (number of topics)",
       y = NULL,
       title = "Model diagnostics by number of topics",
       subtitle = "Higher = Better")

Issues

  • Often misused for text classification - topic models are for corpus exploration!
  • Hardly ever validated (see oolong)
  • Humans have the tendency to read meaning into meaningless patterns

Learn more

  • Wouter van Atteveldt, Damian Trilling & Carlos Arcila: Computational Analysis of Communication, https://cssbook.net/

References

Atteveldt, Wouter van, Damian Trilling, and Carlos Arcíla. 2021. Computational Analysis of Communication: A Practical Introduction to the Analysis of Texts, Networks, and Images with Code Examples in Python and R. Hoboken, NJ: John Wiley & Sons. https://cssbook.net.
Boukes, Mark, Bob Van De Velde, Theo Araujo, and Rens Vliegenthart. 2020. “What’s the Tone? Easy Doesn’t Do It: Analyzing Performance and Agreement Between Off-the-Shelf Sentiment Analysis Tools.” Communication Methods and Measures 14 (2): 83–104. https://doi.org/10.1080/19312458.2019.1671966.
Boumans, Jelle W., and Damian Trilling. 2015. “Taking Stock of the Toolkit.” Digital Journalism 4 (1): 8–23. https://doi.org/10.1080/21670811.2015.1096598.
Duneier, Mitchell. 2017. Ghetto: The Invention of a Place, the History of an Idea. First paperback edition. New York: Farrar, Straus; Giroux.
González-Bailón, Sandra, and Georgios Paltoglou. 2015. “Signals of Public Opinion in Online Communication: A Comparison of Methods and Data Sources.” {SSRN} {Scholarly} {Paper}. Rochester, NY. https://papers.ssrn.com/abstract=2558788.
Grimmer, J., and B. M. Stewart. 2013. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21 (3): 267–97. https://doi.org/10.1093/pan/mps028.
Maas, Andrew L., Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. “Learning Word Vectors for Sentiment Analysis.” In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 142–50. Portland, Oregon, USA: Association for Computational Linguistics. http://www.aclweb.org/anthology/P11-1015.
Rauh, Christian, and Jan Schwalbach. 2020. “The ParlSpeech V2 Data Set: Full-Text Corpora of 6.3 Million Parliamentary Speeches in the Key Legislative Chambers of Nine Representative Democracies.” Harvard Dataverse. https://doi.org/10.7910/DVN/L4OAKN.
Van Atteveldt, Wouter, Mariken A. C. G. Van Der Velden, and Mark Boukes. 2021. “The Validity of Sentiment Analysis: Comparing Manual Annotation, Crowd-Coding, Dictionary Approaches, and Machine Learning Algorithms.” Communication Methods and Measures 15 (2): 121–40. https://doi.org/10.1080/19312458.2020.1869198.

Footnotes

  1. heavily influenced by this piece: https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d