Ollama, and hence rollama
, can be used for text
embedding. In short, text embedding uses the knowledge of the meaning of
words inferred from the context that is saved in a large language model
through its training to turn text into meaningful vectors of numbers.
This technique is a powerful preprocessing step for supervised machine
learning and often increases the performance of a classification model
substantially. Compared to using rollama
directly for
classification, the advantage is that converting text into embeddings
and then using these embeddings for classification is usually faster and
more resource efficient – especially if you re-use embeddings for
multiple tasks.
reviews_df <- read_csv("https://raw.githubusercontent.com/AFAgarap/ecommerce-reviews-analysis/master/Womens%20Clothing%20E-Commerce%20Reviews.csv",
show_col_types = FALSE)
glimpse(reviews_df)
#> Rows: 23,486
#> Columns: 11
#> $ ...1 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
#> $ `Clothing ID` 767, 1080, 1077, 1049, 847, 1080, 858, 858, 1077, 1077, 1077,…
#> $ Age 33, 34, 60, 50, 47, 49, 39, 39, 24, 34, 53, 39, 53, 44, 50, 4…
#> $ Title NA, NA, "Some major design flaws", "My favorite buy!", "Flatt…
#> $ `Review Text` "Absolutely wonderful - silky and sexy and comfortable", "Lov…
#> $ Rating 4, 5, 3, 5, 5, 2, 5, 4, 5, 5, 3, 5, 5, 5, 3, 4, 3, 5, 5, 5, 4…
#> $ `Recommended IND` 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ `Positive Feedback Count` 0, 4, 0, 0, 6, 4, 1, 4, 0, 0, 14, 2, 2, 0, 1, 3, 2, 0, 0, 0, …
#> $ `Division Name` "Initmates", "General", "General", "General Petite", "General…
#> $ `Department Name` "Intimate", "Dresses", "Dresses", "Bottoms", "Tops", "Dresses…
#> $ `Class Name` "Intimates", "Dresses", "Dresses", "Pants", "Blouses", "Dress…
Now this is a rather big dataset, and I don’t want to stress my GPU too much, so I only select the first 500 reviews for embedding. I also process the data slightly by combining the title and review text into a single column and turning the rating into a binary variable:
reviews <- reviews_df |>
slice_head(n = 500) |>
rename(id = ...1) |>
mutate(rating = factor(Rating == 5, c(TRUE, FALSE), c("5", "<5"))) |>
mutate(full_text = paste0(ifelse(is.na(Title), "", Title), `Review Text`))
To turn one or multiple texts into embeddings, you can simply use
embed_text
:
embed_text(text = reviews$full_text[1:3])
#> # A tibble: 3 × 4,096
#> dim_1 dim_2 dim_3 dim_4 dim_5 dim_6 dim_7 dim_8 dim_9 dim_10 dim_11 dim_12 dim_13
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1.32 2.04 1.45 -0.342 0.0834 -0.718 -4.20 3.22 -5.04 -0.466 2.27 0.0522 2.31
#> 2 1.72 -1.29 -0.428 -1.16 0.651 -2.70 -2.60 -0.615 -8.36 0.768 1.98 -2.34 0.286
#> 3 -3.17 0.0776 0.925 -4.02 2.64 -1.40 -0.113 2.18 -5.58 0.0541 3.48 0.434 1.66
#> # ℹ 4,083 more variables: dim_14 <dbl>, dim_15 <dbl>, dim_16 <dbl>, dim_17 <dbl>, dim_18 <dbl>,
#> # dim_19 <dbl>, dim_20 <dbl>, dim_21 <dbl>, dim_22 <dbl>, dim_23 <dbl>, dim_24 <dbl>,
#> # dim_25 <dbl>, dim_26 <dbl>, dim_27 <dbl>, dim_28 <dbl>, dim_29 <dbl>, dim_30 <dbl>,
#> # dim_31 <dbl>, dim_32 <dbl>, dim_33 <dbl>, dim_34 <dbl>, dim_35 <dbl>, dim_36 <dbl>,
#> # dim_37 <dbl>, dim_38 <dbl>, dim_39 <dbl>, dim_40 <dbl>, dim_41 <dbl>, dim_42 <dbl>,
#> # dim_43 <dbl>, dim_44 <dbl>, dim_45 <dbl>, dim_46 <dbl>, dim_47 <dbl>, dim_48 <dbl>,
#> # dim_49 <dbl>, dim_50 <dbl>, dim_51 <dbl>, dim_52 <dbl>, dim_53 <dbl>, dim_54 <dbl>, …
To use this on the sample of reviews, I put the embeddings into a new
column, before unnesting the resulting data.frame. The reason behind
this is that I want to make sure the embeddings belong to the correct
review ID. I also use a different model this time: nomic-embed-text
.
While models like llama3.1
are extremely powerful at
handling conversations and natural language requests, they are also
computationally intensive, and hence relatively slow. As of version
0.1.26, Ollama support using dedicated embedding models, which can
perform the task a lot faster and with fewer resources. Download the
model with pull_model("nomic-embed-text")
then we can
run:
reviews_embeddings <- reviews |>
mutate(embeddings = embed_text(text = full_text, model = "nomic-embed-text")) |>
select(id, rating, embeddings) |>
unnest_wider(embeddings)
#> ✔ embedded 500 texts [25s]
The resulting data.frame contains the ID and rating along the 768 embedding dimensions:
reviews_embeddings
#> # A tibble: 500 × 770
#> id rating dim_1 dim_2 dim_3 dim_4 dim_5 dim_6 dim_7 dim_8 dim_9 dim_10 dim_11
#> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0 <5 1.12 1.56 -4.48 -0.129 -0.373 0.390 -1.24 0.821 -0.0431 -0.261 -0.431
#> 2 1 5 0.792 0.721 -3.14 -0.808 -1.81 1.35 0.403 -0.377 0.358 -0.173 -0.643
#> 3 2 <5 0.539 1.12 -2.58 -0.417 -0.992 1.77 0.895 0.174 0.199 -0.359 -0.637
#> 4 3 5 -0.150 1.25 -4.12 -0.0750 -0.835 1.06 -0.0965 -0.292 0.636 -1.20 -0.188
#> 5 4 5 0.352 0.972 -3.40 -1.18 -0.686 0.489 0.127 0.473 0.259 -0.953 -0.757
#> 6 5 <5 0.907 0.975 -2.78 -0.638 -1.48 2.21 0.373 -0.448 0.395 -0.371 -0.442
#> 7 6 5 0.523 0.321 -2.46 -0.678 -0.640 0.501 0.703 0.320 0.442 0.500 -0.814
#> 8 7 <5 0.224 0.694 -3.12 -0.562 -1.50 -0.0708 0.178 0.144 0.367 0.0269 -0.755
#> 9 8 5 -0.0477 1.21 -3.70 -0.300 -0.936 0.583 0.135 -0.234 0.220 -0.384 0.512
#> 10 9 5 -0.105 1.13 -3.22 -0.310 -1.69 0.857 -0.157 -0.122 0.227 -0.513 -0.333
#> # ℹ 490 more rows
#> # ℹ 757 more variables: dim_12 <dbl>, dim_13 <dbl>, dim_14 <dbl>, dim_15 <dbl>, dim_16 <dbl>,
#> # dim_17 <dbl>, dim_18 <dbl>, dim_19 <dbl>, dim_20 <dbl>, dim_21 <dbl>, dim_22 <dbl>,
#> # dim_23 <dbl>, dim_24 <dbl>, dim_25 <dbl>, dim_26 <dbl>, dim_27 <dbl>, dim_28 <dbl>,
#> # dim_29 <dbl>, dim_30 <dbl>, dim_31 <dbl>, dim_32 <dbl>, dim_33 <dbl>, dim_34 <dbl>,
#> # dim_35 <dbl>, dim_36 <dbl>, dim_37 <dbl>, dim_38 <dbl>, dim_39 <dbl>, dim_40 <dbl>,
#> # dim_41 <dbl>, dim_42 <dbl>, dim_43 <dbl>, dim_44 <dbl>, dim_45 <dbl>, dim_46 <dbl>, …
As said above, these embeddings are often used in supervised machine
learning. I use part of a
blog post by Emil Hvitfeldt show how this can be done using the data
we embedded above in the powerful tidymodels
collection of
packages:
library(tidymodels)
# split data into training an test set (for validation)
set.seed(1)
reviews_split <- initial_split(reviews_embeddings)
reviews_train <- training(reviews_split)
# set up the model we want to use
lasso_spec <- logistic_reg(penalty = tune(), mixture = 1) |>
set_engine("glmnet")
# we specify that we want to do some hyperparameter tuning and bootstrapping
param_grid <- grid_regular(penalty(), levels = 50)
reviews_boot <- bootstraps(reviews_train, times = 10)
# and we define the model. Here we use the embeddings to predict the rating
rec_spec <- recipe(rating ~ ., data = select(reviews_train, -id))
# bringing this together in a workflow
wf_fh <- workflow() |>
add_recipe(rec_spec) |>
add_model(lasso_spec)
# now we do the tuning
set.seed(42)
lasso_grid <- tune_grid(
wf_fh,
resamples = reviews_boot,
grid = param_grid
)
# select the best model
wf_fh_final <- wf_fh |>
finalize_workflow(parameters = select_best(lasso_grid, metric = "roc_auc"))
# and train a new model + predict the classes for the test set
final_res <- last_fit(wf_fh_final, reviews_split)
# we extract these predictions
final_pred <- final_res |>
collect_predictions()
# and evaluate them with a few standard metrics
my_metrics <- metric_set(accuracy, precision, recall, f_meas)
my_metrics(final_pred, truth = rating, estimate = .pred_class)
#> # A tibble: 4 × 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 accuracy binary 0.744
#> 2 precision binary 0.690
#> 3 recall binary 0.831
#> 4 f_meas binary 0.754
# and the ROC curve
final_pred |>
roc_curve(rating, .pred_5) |>
autoplot()