Module 03 GESIS Fall Seminar “Introduction to Computational Social Science”
GESIS
University of Waterloo
| time | Session |
|---|---|
| Day 1 | Introduction to Computational Social Science |
| Day 2 | Obtaining Data |
| Day 3 | Computational Network Analysis |
| Day 4 | Computational Text Analysis |
| Day 5 | Large Language Models in the Social Sciences |
From Grimmer and Stewart (2013)
From Boumans and Trilling (2015)
From myself.
Taken from Duneier (2017): Ghetto: The Invention of a Place, the History of an Idea
# Ngram data table
# Phrases:
# Case-sensitive:
# Corpuses:
# Smoothing:
# Years: 1920-1975
Year Phrase Frequency Corpus Parent type
1 1920 ghetto 5.954661e-08 en NGRAM
2 1921 ghetto 4.997540e-08 en NGRAM
3 1922 ghetto 6.947938e-08 en NGRAM
4 1923 ghetto 7.295493e-08 en NGRAM
5 1924 ghetto 7.934730e-08 en NGRAM
6 1925 ghetto 2.008454e-07 en NGRAM
This part is taken from Atteveldt, Trilling, and Arcíla (2021) Chapter 11.2. We first get the data for this, which consists of movie reviews from the IMDB database (Maas et al. 2011).
# You can ignore this part where I download and process the data. The code is
# not really applicable for other dataset. But I left it in here in case you
# find it interesting.
data_file <- "data/imdb.rds"
if (!file.exists(data_file)) {
message("Downloading data")
temp <- file.path(tempdir(), "imdb")
dir.create(temp, recursive = TRUE)
curl::curl_download("https://cssbook.net/d/aclImdb_v1.tar.gz",
file.path(temp, "imdb.tar.gz"), quiet = FALSE)
untar(file.path(temp, "imdb.tar.gz"), exdir = temp)
files <- list.files(temp,
pattern = ".txt",
recursive = TRUE,
full.names = TRUE)
imdb <- map(files, function(f) {
tibble(
file = f,
text = readLines(f, warn = FALSE)
)
}) |>
bind_rows() |>
mutate(label = str_extract(file, "/pos/|/neg/"),
label = str_remove_all(label, "/"),
label = factor(label)) |>
filter(!is.na(label)) |>
select(-file) |>
# adding unique IDs for later
mutate(id = row_number())
saveRDS(imdb, data_file)
} else {
message("Using cached data")
imdb <- readRDS(data_file)
}We download a dictionary from the Computational Analysis of Communication website (Atteveldt, Trilling, and Arcíla 2021), which consists of a list of positive, and one list of negative words.
poswords <- "https://cssbook.net/d/positive.txt"
negwords <- "https://cssbook.net/d/negative.txt"
sentiment_dict <- bind_rows(
tibble(word = scan(poswords, what = "character"), value = 1),
tibble(word = scan(negwords, what = "character"), value = -1)
)
sentiment_dict[c(1:5, 5660:5664), ]# A tibble: 10 × 2
word value
<chr> <dbl>
1 a+ 1
2 abound 1
3 abounds 1
4 abundance 1
5 abundant 1
6 zaps -1
7 zealot -1
8 zealous -1
9 zealously -1
10 zombie -1
We then go through all reviews and construct a sentiment score by looking up each word and adding up its score:
scores_df <- imdb |>
# For speed, we only take the first 100 reviews
head(100) |>
# This splits up the texts into its individual words
unnest_tokens(output = "word", input = "text", drop = FALSE) |>
# We attach the sentiment_dict to the text data.frame. inner_join drops
# rows where the word is not in both data.frames
inner_join(sentiment_dict, by = "word") |>
# For each text, we calcuate the sum of values
group_by(id) |>
summarise(senti_score = sum(value),
text = head(text, 1))
head(scores_df)# A tibble: 6 × 3
id senti_score text
<int> <dbl> <chr>
1 1 3 "Once again Mr. Costner has dragged out a movie for far lon…
2 2 1 "This is a pale imitation of 'Officer and a Gentleman.' The…
3 3 16 "Years ago, when DARLING LILI played on TV, it was always t…
4 4 2 "I was looking forward to this movie. Trustworthy actors, i…
5 5 9 "First of all, I would like to say that I am a fan of all o…
6 6 5 "This is an example of why the majority of action films are…
More commonly, people normalize the absolute count of positive and negative words to construct a score:
scores_df <- imdb |>
# For speed, we only take the first 100 reviews
head(100) |>
unnest_tokens(output = "token", input = "text", drop = FALSE) |>
inner_join(sentiment_dict, by = c("token" = "word")) |>
group_by(id) |>
# here, we normalise the outcome and assign a sentiment category
summarise(senti_score = sum(value) / n(),
sentiment = ifelse(senti_score >= 0, "pos", "neg"),
text = head(text, 1))
head(scores_df)# A tibble: 6 × 4
id senti_score sentiment text
<int> <dbl> <chr> <chr>
1 1 0.333 pos "Once again Mr. Costner has dragged out a movie f…
2 2 0.0667 pos "This is a pale imitation of 'Officer and a Gentl…
3 3 0.222 pos "Years ago, when DARLING LILI played on TV, it wa…
4 4 0.125 pos "I was looking forward to this movie. Trustworthy…
5 5 0.429 pos "First of all, I would like to say that I am a fa…
6 6 0.263 pos "This is an example of why the majority of action…
We can plot these results to get an impression how often each category was predicted and how strong the senti_score was in these cases.
We can validate this approach by comparing the measured sentiment to the real sentiment, as given in the dataset:
validation_df <- scores_df |>
select(-text) |>
left_join(imdb, by = "id")
# have a look at the new data.frame
validation_df |>
select(id, label, sentiment, senti_score)# A tibble: 100 × 4
id label sentiment senti_score
<int> <fct> <chr> <dbl>
1 1 neg pos 0.333
2 2 neg pos 0.0667
3 3 neg pos 0.222
4 4 neg pos 0.125
5 5 neg pos 0.429
6 6 neg pos 0.263
7 7 neg neg -0.167
8 8 neg pos 0.217
9 9 neg pos 0.133
10 10 neg neg -0.4
# ℹ 90 more rows
An easy way to validate performance is to calculate how often the prediction and the real sentiment match:
# A tibble: 2 × 2
match n
<lgl> <int>
1 FALSE 54
2 TRUE 46
However, the absolute count of matches, or accuracy, is prone to errors, since it does not take into account chance. For example, by taking only the first 100 reviews, we happen to have gather data that has just negative cases:
So while optimising our mapping function we could accidentally make a wrong adjustment:
scores_df_error <- imdb |>
head(100) |>
unnest_tokens(output = "token", input = "text", drop = FALSE) |>
inner_join(sentiment_dict, by = c("token" = "word")) |>
group_by(id) |>
summarise(senti_score = sum(value) / n(),
sentiment = ifelse(senti_score >= 0, "neg", "neg"),
# see the error here --^
text = head(text, 1))
scores_df_error |>
select(-text) |>
left_join(imdb, by = "id") |>
count(match = label == sentiment)# A tibble: 1 × 2
match n
<lgl> <int>
1 TRUE 100
Now suddenly our accuracy is perfect! This is why we use a couple of metrics that control for chance:
conf_matrix <- table(validation_df$label, validation_df$sentiment)
ml_metrics <- metric_set(accuracy, precision, recall, f_meas, kap)
conf_matrix
neg pos
neg 46 54
pos 0 0
# A tibble: 5 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.46
2 precision binary 0.46
3 recall binary 1
4 f_meas binary 0.630
5 kap binary 0
Precision: This is the ratio of correctly predicted positive observations to the total predicted positives. High precision means that an algorithm returned more relevant results than irrelevant ones.
Recall (Sensitivity): This is the ratio of correctly predicted positive observations to the all observations in actual class. High recall means that an algorithm returned most of the relevant results.
F1 Score: This is the weighted average of Precision and Recall. It tries to find the balance between precision and recall. High F1 score means that both recall and precision are high.
We often use a confusion matrix to calculate these metrics. A confusion matrix is a 2x2 table that contains 4 outputs provided by the binary classifier. The terminology can vary, but it’s often formatted as follows:
| Actual Positive | Actual Negative | |
|---|---|---|
| Predicted Positive | True Positive (TP) | False Positive (FP) |
| Predicted Negative | False Negative (FN) | True Negative (TN) |
Based on this matrix:
Let’s say we have a binary classifier that’s being used to predict whether a given email is “spam” (positive class) or “not spam” (negative class). Suppose our classifier is tested on a dataset of 100 emails, which include 20 actual spam messages and 80 actual non-spam messages. The classifier outputs the following results:
| Actual Spam | Actual Not Spam | |
|---|---|---|
| Predicted Spam | 15 (TP) | 5 (FP) |
| Predicted Not Spam | 5 (FN) | 75 (TN) |
Here, the precision, recall, and F1 score would be calculated as:
So in this case, the classifier has a precision of 0.75 (meaning that 75% of the emails it labeled as “spam” were actually spam), a recall of 0.75 (meaning that it correctly identified 75% of the total spam emails), and an F1 score of 0.75 (giving a balance between precision and recall). Which of these metrics is most useful depends on the task. If you detect cancer in patients, false positives are not great, but false negatives might be deadly! In this case you should optimise the recall and only look at the other metrics as an afterthought. However, most of the time, we want to get a good F1 value (rule of thumb is above 0.7 or better 0.8). Note that in most cases, we do not really care about a “positive” class as both classes are equally important. And often, we have more than 2 classes. A good strategy here is to calculate F1 for each class and reporting all values plus the average.
Now that you know about dictionaries, remember to apply them only under some circumstances:
We proceed in 4 (or 5) steps:
Several frameworks in R:
RTextTools: includes the most important algorithms, but outdatedcaret: includes a huge variety of models and comes with robust helper functions, but supersededquanteda: nice suite of text analysis functions, including quanteda.textmodelstidymodels: successor of caret with an extensive collection of algorithms, preprocessing, validation and comparison functionstextrecipes to turn text into featurestextrecipes (an extension of the recipes package): handles all preprocessing in a unified syntaxDefine recipe:
library(tidymodels)
library(textrecipes)
imdb_rec <- recipe(label ~ text, data = imdb) |>
# turn text into features/tokens/words
step_tokenize(all_predictors()) |>
# remove stopwords
step_stopwords(all_predictors(), language = "en") |>
step_tokenfilter(all_predictors(), min_times = 3) |>
step_tf(all_predictors())Bake recipe with ingredients (data):
# A tibble: 10 × 101
label tf_text_act tf_text_acting tf_text_action tf_text_actors tf_text_almost
<fct> <int> <int> <int> <int> <int>
1 neg 0 0 0 0 0
2 neg 0 0 1 0 0
3 neg 1 0 2 0 1
4 neg 0 0 0 1 1
5 neg 0 0 0 1 1
6 neg 0 3 2 0 1
7 neg 2 1 0 2 0
8 neg 0 2 1 0 0
9 neg 0 0 1 0 0
10 neg 0 0 0 0 0
# ℹ 95 more variables: tf_text_also <int>, tf_text_always <int>,
# tf_text_andrews <int>, tf_text_around <int>, tf_text_back <int>,
# tf_text_badly <int>, tf_text_best <int>, tf_text_better <int>,
# tf_text_bill <int>, tf_text_blake <int>, tf_text_box <int>,
# tf_text_br <int>, tf_text_brass <int>, tf_text_can <int>,
# tf_text_care <int>, tf_text_character <int>, tf_text_characters <int>,
# tf_text_comes <int>, tf_text_costner <int>, tf_text_created <int>, …
Note
Stopwords are words that are thought to add no meaning to a text and can be safely removed. Many text analysis packages come with their own selection, but whether it’s useful depends on context! You can look at stopwords::stopwords() to see one selection.
Important
Making the split only once is not good practice, as which texts end up in the training and test portion of the data can influence the performance metrics of the model. Some researchers have even been caught trying many different seeds to find one that delivers better looking results. What you should do instead is to use the bootstraps() to fit multiple different combinations. Check out one of Julia Silge’s videos if you want to learn more.
Let’s start with a naïve bayes model, for which we need another package which contains this “engine” (you don’t need to know this, the next step would tell you to load the package):
Now we bring the recipe and model together in a new workflow:
Now, we fit the model:
imdb_prediction <- imdb_test |>
bind_cols(predict(model_nb, new_data = imdb_test)) |>
rename(truth = label, estimate = .pred_class)
conf_mat(imdb_prediction, truth, estimate) Truth
Prediction neg pos
neg 5861 4142
pos 389 2108
library(gt)
my_metrics <- metric_set(accuracy, kap, precision, recall, f_meas)
my_metrics(imdb_prediction, truth = truth, estimate = estimate) |>
# I use gt to make the table look a bit nicer, but it's optional
gt() |>
data_color(
columns = .estimate,
fn = scales::col_numeric(
palette = c("red", "orange", "green"),
domain = c(0, 1)
)
)| .metric | .estimator | .estimate |
|---|---|---|
| accuracy | binary | 0.6375200 |
| kap | binary | 0.2750400 |
| precision | binary | 0.5859242 |
| recall | binary | 0.9377600 |
| f_meas | binary | 0.7212207 |
Let’s walk through the steps again to train a logistic regression model:
# choosing the model
lasso_spec <- logistic_reg(penalty = 0.01, mixture = 1) |>
set_mode("classification") |>
set_engine("glmnet")
# bring preprocessing and model together in a workflow
imdb_wf_lasso <- workflow() |>
add_recipe(imdb_rec) |>
add_model(lasso_spec)
# train the model
model_lasso <- fit(imdb_wf_lasso, data = imdb_train)
# evaluate
imdb_prediction_lasso <- imdb_test |>
bind_cols(predict(model_lasso, new_data = imdb_test)) |>
rename(truth = label, estimate = .pred_class)
my_metrics <- metric_set(accuracy, kap, precision, recall, f_meas)
my_metrics(imdb_prediction_lasso, truth = truth, estimate = estimate) |>
gt() |>
data_color(
columns = .estimate,
colors = scales::col_numeric(
palette = c("red", "orange", "green"),
domain = c(0, 1)
)
)| .metric | .estimator | .estimate |
|---|---|---|
| accuracy | binary | 0.7252800 |
| kap | binary | 0.4505600 |
| precision | binary | 0.7390492 |
| recall | binary | 0.6964800 |
| f_meas | binary | 0.7171334 |
A way to make sense of a model is to look at coefficients. This tells us essentially, what the model thinks are important terms to say a document is positive or negative.
model_lasso |>
extract_fit_engine() |>
vip::vi() |>
group_by(Sign) |>
slice_max(Importance, n = 20) |>
ungroup() |>
mutate(
Variable = str_remove(Variable, "tf_text_"),
Variable = fct_reorder(Variable, Importance)
) |>
ggplot(aes(x = Importance, y = Variable, fill = Sign)) +
geom_col(show.legend = FALSE) +
facet_wrap(~Sign, scales = "free_y") +
labs(y = NULL)A good first step towards understanding what topic models are and how they can be useful, is to simply play around with them, so that’s what we’ll do first. Open the page https://lettier.com/projects/lda-topic-modeling/ (if possible, in Firefox)
Example Docs:
On top, the texts are turned into a document feature matrix.
What LDA does (essentially):
The first matrix describes the probability a feature belongs to a topic. We call this the feature-topic-matrix:
This case makes it pretty clear:
The second table does the same, but with documents. We call this the document-term-matrix:
What makes LDA often seem magical, is how well it works for text. This is because the underlying assumptions fit the statistical features of document collections quite well, leading to meaningful and interpretable categories that make it easy to explore and summarise what is happening in text.
The short example above shows the three broad steps of topic modelling:
Before we go on, here are a few things I want you to try:
We use a subset of the Parlspeech corpus (Rauh and Schwalbach 2020), spanning the 18th legislative period of the Bundestag.
# Again, you can ignore this part where I download and process the data. The code is
# not really applicable for other dataset. But I left it in here in case you
# find it interesting..
data_file <- "data/bundestag18_speeches.rds"
if (!file.exists(data_file)) {
library(dataverse)
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
ds <- get_dataset("doi:10.7910/DVN/L4OAKN")
as_tibble(ds[["files"]][, c("label", "id")])
url <- get_file(3758785,
dataset = "doi:10.7910/DVN/ZY3RV7",
return_url = TRUE)
curl::curl_download(url, destfile = "data/Corp_Bundestag_V2.rds", quiet = FALSE)
bundestag18 <- readRDS("data/Corp_Bundestag_V2.rds") |>
mutate(date = ymd(date),
speechnumber = as.integer(speechnumber)) |>
filter(date >= "2013-10-22",
date <= "2017-10-24") |>
mutate(doc_id = paste0(date, "-", speechnumber)) |>
as_tibble()
saveRDS(bundestag18, data_file)
} else {
bundestag18 <- readRDS(data_file)
}We first tidy the documents:
Secondly, we do some light cleaning:
bundestag18_tidy_clean <- bundestag18_tidy |>
filter(!word %in% c(stopwords::stopwords(language = "de"),
"dass", "kollege", "kollegin", "herr", "frau", "dr")) |>
group_by(word) |>
filter(
n() > 10, # keep features that appear more than 10 times
!str_detect(word, "[^a-z]") # keep features that consist only of characters
) |>
ungroup()
print(glue::glue(
"Cleaning removed {length(unique(bundestag18_tidy$word)) - length(unique(bundestag18_tidy_clean$word))} ",
"unique words and ",
"{length(unique(bundestag18_tidy$doc_id)) - length(unique(bundestag18_tidy_clean$doc_id))} documents. ",
"{length(unique(bundestag18_tidy$word))} unique words remain in ",
"{length(unique(bundestag18_tidy$doc_id))} documents"
))Cleaning removed 197067 unique words and 986 documents. 224188 unique words remain in 52784 documents
Now we can create a document-feature-matrix. This is a (sparse) matrix showing how often each term (column) occurs in each document (row):
We can inspect a corner of the dfm by casting it to a regular (dense) matrix:
topicmodels or lda or stm package, stay away from mallet, try BERTopic if you know a little python)textmodel_lda() function from the seededlda package due to the clean syntaxlibrary(seededlda)
k <- 10
set.seed(2024) # Note, that we use `set.seed` to create reproducible results
lda_model <- textmodel_lda(
bundestag18_dfm,
k = k, # the number of topics is chosen at random for demonstration purposes
max_iter = 200L, # I would not usually recommend that few iterations, it's just so it runs quicker here
alpha = 50L / k, # these are the default values in the package
beta = 0.1,
verbose = TRUE
)Fitting LDA with 10 topics
...initializing
...Gibbs sampling in 200 iterations
......iteration 100 elapsed time: 37.83 seconds (delta: 0.09%)
......iteration 200 elapsed time: 72.02 seconds (delta: -0.00%)
...computing theta and phi
...complete
bundestag18_ftm <- lda_model$phi |>
as.data.frame() |># converting the matrix into a data.frame makes sure it plays nicely with the tidyverse
rownames_to_column("topic") |># the topic names/numbers are stored in the row.names, I move them to a column
mutate(topic = fct_inorder(topic)) |># turn to factor to maintain the correct order
pivot_longer(-topic, names_to = "word", values_to = "phi")
topic_topwords_plot <- bundestag18_ftm |># turn to long for plotting
group_by(topic) |># using group_by and slice_max, we keep only the top 10 values from each topic
slice_max(order_by = phi, n = 15) |>
# using reorder_within does some magic for a nicer plot
mutate(word = tidytext::reorder_within(word, by = phi, within = topic)) |>
# from here on, we just make a bar plot with facets
ggplot(aes(x = phi, y = word, fill = topic)) +
geom_col() +
tidytext::scale_y_reordered() +
facet_wrap(~topic, ncol = 2, scales = "free_y")
topic_topwords_plotGoing forward, I would now name these topics. I found this particular format in a Excel sheet helpful.
lda_model$phi |>
as.data.frame() |>
rowid_to_column("topic") |>
pivot_longer(-topic, names_to = "word", values_to = "phi") |>
group_by(topic) |>
slice_max(order_by = phi, n = 20) |>
mutate(top = row_number()) |>
pivot_wider(id_cols = top, names_from = topic, values_from = word) |>
# Add an extra row where you can write in topic names
add_row(top = NA, .before = 1) %>%
rio::export("topicsmodel_topwords.xlsx")Similarly to above, we can also extract to topics per document:
bundestag18_dtm <- lda_model$theta |>
as.data.frame() |>
rownames_to_column("doc_id") |>
as_tibble()
bundestag18_dtm# A tibble: 51,798 × 11
doc_id topic1 topic2 topic3 topic4 topic5 topic6 topic7 topic8 topic9 topic10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2013-… 0.0936 0.171 0.355 0.0195 0.0890 0.0667 0.0371 0.0306 0.103 0.0343
2 2013-… 0.260 0.0632 0.123 0.0421 0.0632 0.0246 0.0772 0.214 0.0421 0.0912
3 2013-… 0.0980 0.0980 0.0980 0.0980 0.0980 0.118 0.0980 0.0980 0.0980 0.0980
4 2013-… 0.112 0.0875 0.112 0.075 0.075 0.0625 0.138 0.15 0.1 0.0875
5 2013-… 0.0909 0.0909 0.109 0.0909 0.0909 0.127 0.0909 0.0909 0.0909 0.127
6 2013-… 0.255 0.0691 0.112 0.0372 0.0638 0.0426 0.122 0.0904 0.0585 0.149
7 2013-… 0.0943 0.0943 0.0943 0.0943 0.0943 0.113 0.0943 0.0943 0.0943 0.132
8 2013-… 0.257 0.0616 0.0978 0.0290 0.0435 0.0399 0.0833 0.207 0.0652 0.116
9 2013-… 0.0865 0.0602 0.0564 0.0263 0.0338 0.0677 0.0526 0.113 0.0376 0.466
10 2013-… 0.0841 0.0561 0.0748 0.0654 0.0841 0.0841 0.0841 0.112 0.0841 0.271
# ℹ 51,788 more rows
We can tidy this and join the results back with the original metadata using the doc_id:
bundestag18_dtm_tidy <- bundestag18_dtm |>
pivot_longer(-doc_id, names_to = "topic", values_to = "theta") |>
# again this is to keep track of the order as it is otherwise order by alphabet
mutate(topic = fct_inorder(topic)) |>
left_join(bundestag18 |> select(-text), by = "doc_id")
bundestag18_dtm_tidy# A tibble: 517,980 × 13
doc_id topic theta date agenda speechnumber speaker party
<chr> <fct> <dbl> <date> <chr> <int> <chr> <chr>
1 2013-10-22-1 topic1 0.0936 2013-10-22 SITZUNGSBE… 1 Heinz … <NA>
2 2013-10-22-1 topic2 0.171 2013-10-22 SITZUNGSBE… 1 Heinz … <NA>
3 2013-10-22-1 topic3 0.355 2013-10-22 SITZUNGSBE… 1 Heinz … <NA>
4 2013-10-22-1 topic4 0.0195 2013-10-22 SITZUNGSBE… 1 Heinz … <NA>
5 2013-10-22-1 topic5 0.0890 2013-10-22 SITZUNGSBE… 1 Heinz … <NA>
6 2013-10-22-1 topic6 0.0667 2013-10-22 SITZUNGSBE… 1 Heinz … <NA>
7 2013-10-22-1 topic7 0.0371 2013-10-22 SITZUNGSBE… 1 Heinz … <NA>
8 2013-10-22-1 topic8 0.0306 2013-10-22 SITZUNGSBE… 1 Heinz … <NA>
9 2013-10-22-1 topic9 0.103 2013-10-22 SITZUNGSBE… 1 Heinz … <NA>
10 2013-10-22-1 topic10 0.0343 2013-10-22 SITZUNGSBE… 1 Heinz … <NA>
# ℹ 517,970 more rows
# ℹ 5 more variables: party.facts.id <dbl>, chair <lgl>, terms <dbl>,
# parliament <chr>, iso3country <chr>
Now, we can e.g. compare topic usage per party:
bundestag18_dtm_tidy |>
filter(!is.na(party),
party != "independent") |>
group_by(party, topic) |>
summarize(theta = mean(theta)) %>%
ggplot(aes(x = theta, y = topic, fill = party)) +
geom_col(position = "dodge") +
scale_fill_manual(values = c(
"PDS/LINKE" = "#BD3075",
"SPD" = "#D71F1D",
"GRUENE" = "#78BC1B",
"CDU/CSU" = "#121212",
"FDP" = "#FFCC00",
"AfD" = "#4176C2"
))Or over time:
A popular way of exploring topics and how they overlap is the LDAvis package.
LDAvis helps users understand the relationships between topics and the key terms that define them.
tokenbrowserWe can check how the probabilities for documents are calculated by looking at the words and topic probabilities in their original context using the using the tokenbrowser package developed by Kasper Welbers.
We first select the 2000 tokens with the highest phi value, i.e., the ones which most clearly belong to one topic.
Then we attach the categories to the original tidy representation of the texts.
Now we can look at the words that clearly belong to a topic in context of the full speeches.
library(furrr)
lda_models <- "data/lda_models.rds"
# since this takes a long time, I save the results in case I want to run it again
if (!file.exists(lda_models)) {
plan(multisession) ## for parallel processing
lda_fun <- function(k, max_iter = 200) {
textmodel_lda(
bundestag18_dfm,
k = k,
max_iter = max_iter,
alpha = 50 / k,
beta = 0.1,
verbose = TRUE
)
}
models_df <- tibble(
k = c(10:20),
model = future_map(k, lda_fun, .options = furrr_options(seed = 1))
)
saveRDS(models_df, lda_models)
} else {
models_df <- readRDS(lda_models)
}There is no official function in seededlda to evaluate different models. The stm package is much better here as demonstrated by Julia Silge. But since I did not want to introduce another package, I copied the functions that are currently discussed from this issue on GitHub.
semantic_coherence <- function(model, top_n = 10) {
h <- apply(terms(model, top_n), 2, function(y) {
d <- model$data[,y]
e <- Matrix::Matrix(docfreq(d), nrow = nfeat(d), ncol = nfeat(d))
f <- fcm(d, count = "boolean") + 1
g <- Matrix::band(log(f / e), 1, ncol(f))
sum(g)
})
sum(h)
}
divergence <- function(model) {
div <- proxyC::dist(model$phi, method = "kullback")
diag(div) <- NA
mean(as.matrix(div), na.rm = TRUE)
}
# this one is taken from stm https://github.com/bstewart/stm/blob/master/R/exclusivity.R
exclusivity <- function(model, top_n = 10, frexw = 0.7) {
tphi <- t(exp(model$phi))
s <- rowSums(tphi)
mat <- tphi / s # normed by columns of beta now.
ex <- apply(mat, 2, rank) / nrow(mat)
fr <- apply(tphi, 2, rank) / nrow(mat)
frex <- 1 / (frexw / ex + (1 - frexw) / fr)
index <- apply(tphi, 2, order, decreasing = TRUE)[1:top_n, ]
out <- vector(length = ncol(tphi))
for (i in seq_len(ncol(frex))) {
out[i] <- sum(frex[index[, i], i])
}
return(mean(out))
}We can now use this plot to evaluate the different models.
models_df_metrics <- models_df |>
mutate(semantic_coherence = map_dbl(model, semantic_coherence),
exclusivity = map_dbl(model, exclusivity),
divergence = map_dbl(model, divergence))
models_df_metrics |>
select(-model) |>
pivot_longer(-k, names_to = "metric") |>
ggplot(aes(x = k, value, color = metric)) +
geom_line(linewidth = 1.5, alpha = 0.7, show.legend = FALSE) +
scale_x_continuous(breaks = scales::pretty_breaks()) +
facet_wrap(~metric, scales = "free_y") +
labs(x = "K (number of topics)",
y = NULL,
title = "Model diagnostics by number of topics",
subtitle = "Higher = Better")heavily influenced by this piece: https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d
CC BY-SA 4.0