Predicting the speaker of dialogue from The Office.
TL;DR:: In this blog, I use LASSO logistic regression and multilevel logistic regression to predict the speaker of lines of dialogue from The Office.
What feels like forever ago, I wrote two blog posts analyzing transcripts from The Office. The first was a basic EDA of the dialogue, and the second used k-means clustering to determine types of Office episodes based on who speaks to whom. At the end of that second blog, I mentioned that I might do some predictive analysis with that data in the future. Well, it’s four months later, and I’m declaring that the future is now!
Basically, the goal here is going to be, for a given line of dialogue from the show, to predict whether it’s Michael talking or someone else. At first blush, this seems like it shouldn’t be too hard. Many of Michael’s lines are iconic (e.g. see the above gif), but I feel like this might be more a function of the delivery than the actual words themselves, and I’m curious to see how well a model (or multiple models) could predict this just from the text.
In doing this, there are a couple of things I’m interested in doing here:
{tidymodels}
{textrecipes}
package{glmmTMB}
package (not part of the {tidymodels}
ecosystem)Also, before getting too much further, I want to acknowledge that I looked at this blog by Julia Silge and this blog by Emil Hvitfeldt for some background on {textrecipes}
. Both are really great for people interested in text analysis.
Anyway, without much further ado, let’s get into it. As has been the case in all of my “Scrantonicity” posts, the data I’m using here comes from the {schrute}
package. First, I’ll load in libraries and set some defaults/options. I’m also going to read in the data, limiting the dialogue to the first seven seasons of the show (the Michael Scott era).
Before modeling data, I would typically do a more thorough EDA. But I’ve already explored this data pretty closely (albeit months ago) in two previous blog posts, so rather than re-doing that EDA, I’m just going to look at those posts. One thing I will include here, though, is a quick look at the number of lines spoken by Michael Scott vs other characters, since this is the outcome I’m interested in predicting here.
office %>%
count(character) %>%
top_n(10) %>%
ggplot(aes(x = n, y = fct_reorder(character, n))) +
geom_col(fill = herm) +
labs(
title = "Lines by Character",
subtitle = "First seven seasons",
y = NULL,
x = "Number of Lines"
)
So, Michael has far and away the most lines of any character. But it’ll also be useful to look at Michael vs all of the others lumped together (since this is what I’m actually predicting).
office %>%
count(is_mike) %>%
ggplot(aes(x = n, y = fct_reorder(is_mike, n))) +
geom_col(fill = herm) +
labs(
title = "Mike vs Not Mike",
y = "Is Michael?",
x = "Number of Lines"
)
Even though Michael speaks more than any other given character, he speaks about a third as many lines as all of the other characters combined. This is relevant here because it means I’ll want to downsample when I train my model to ensure the number of observations in each class are similar, which will help the model fit.
Next, I’m going to split my data into a training a testing set.
set.seed(0408)
office_split <- initial_split(office, strata = is_mike)
tr <- training(office_split)
te <- testing(office_split)
Now that I’ve split my data, I’m going to preprocess the data using {recipes}
, {textrecipes}
, and {themis}
(to handle class imbalance). One thing to clarify here: I’m building a model to predict whether the speaker of a given line of dialogue is Michael. In this analysis, I want to build this model using only the text data, although there are plenty of other text-based features I could include. More specifically, I am going to handle the preprocessing such that the model I end up fitting is a bag-of-words model. This means that I want my data to include a variable for each word* (not really each word, but I’ll show later) in the transcript, each row to represent a line of dialogue, and the value in each cell to represent the tf-idf of that word. From this data structure, I can build a model where each word has an individual effect on the odds that the line is spoken by Michael, although note that this model will have no sense of word order.
I’ll specify this recipe and then walk through each step afterward.
office_recipe <- recipe(is_mike ~ text + episode_name, data = tr) %>%
themis::step_downsample(is_mike) %>%
step_tokenize(text) %>%
step_stopwords(text) %>%
step_tokenfilter(text, max_tokens = 200) %>%
step_tfidf(text) %>%
prep()
tr_prepped <- juice(office_recipe)
tr_prepped_noep <- tr_prepped %>%
select(-episode_name)
te_prepped <- bake(office_recipe, te)
te_prepped_noep <- te_prepped %>%
select(-episode_name)
Let’s unpack this step-by-step:
step_downsample()
will balance the data so that the number of cases where Michael is the speaker is equal to the number of cases where Michael is not the speaker. This is done by randomly dropping rows.step_tokenize()
will take the text column in the data and create a isolate each word per line.step_stopwords()
will remove stop words (e.g. “the”, “it”, “a”) that likely won’t contain much useful information.step_tokenfilter()
, as I’m using it here, will retain only the 200 most frequently used words. This is a pretty large number, but I’m going to fit a LASSO regression later, which can select out some of these if necessary.step_tfidf()
calculates the term frequency-inverse document frequency, which provides a metric for how important a word is to a given document (e.g. a line in this case).Another thing to note here is that I’m creating two versions of this preprocessed data for the training and test sets. The differences between “tr_prepped” and “tr_prepped_noep” (as well as their “te” counterparts) is that the “noep” versions do not have a variable identifying which line the episode came from (but are otherwise identical). This is because I don’t want to include the episode identifier in my single-level LASSO model but do want to include it in the multilevel model. I could also accomplish this by specifying the formula and having it not include the episode_number variable rather than creating two datasets.
Moving along! Next, I’m going to specify my model. Since I have a binary outcomes (yes/no if the speaker is Michael), I’m going to run a logistic regression. I’m going to run this as a LASSO model, which will provide some feature selection and generally shrink coefficients. I’m going to tune the model to choose the best amount of penalty as well.
reg_spec <- logistic_reg(mixture = 1, penalty = tune()) %>%
set_engine("glmnet")
reg_spec
Logistic Regression Model Specification (classification)
Main Arguments:
penalty = tune()
mixture = 1
Computational engine: glmnet
Here, I’m creating some resamples of my training data to help with the tuning. I’m creating 10 bootstrap samples here.
set.seed(0408)
booties <- bootstraps(tr_prepped_noep, strata = is_mike, times = 10)
Now it’s time to fit the LASSO model. I’m going to add the logistic regression specification that I just created to a workflow. Along with that model specification, I’m also going to add a formula where is_mike
is regressed on all of the word features I just created. Then, I’m going to tune the model across 10 candidate values of the penalty parameter (i.e. how much regularization I’m adding).
office_wf <- workflow() %>%
add_model(reg_spec) %>%
add_formula(is_mike ~ .)
set.seed(0408)
logreg_fit <- tune_grid(
office_wf,
resamples = booties,
grid = 10
)
Great. Now that the models have been fit with various penalty values across the bootstrap resamples, I can check to see what the best penalty value is to move forward with & finalize a model. I’m going to choose the best by one standard error (which, in this case, happens also to be the best model). The one standard error rule will let me choose the most parsimonious model (in this case, the one with the most penalty) that is within one standard error of the best model. And once I choose the best penalty value, I’ll go ahead and finalize the model and refit on the training set.
logreg_fit %>%
show_best("accuracy")
# A tibble: 5 x 7
penalty .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 2.27e- 3 accuracy binary 0.580 10 0.00181 Preprocessor1_Mode~
2 1.02e-10 accuracy binary 0.577 10 0.00216 Preprocessor1_Mode~
3 1.27e- 9 accuracy binary 0.577 10 0.00216 Preprocessor1_Mode~
4 7.94e- 8 accuracy binary 0.577 10 0.00216 Preprocessor1_Mode~
5 4.46e- 7 accuracy binary 0.577 10 0.00216 Preprocessor1_Mode~
best_params <- logreg_fit %>%
select_by_one_std_err(metric = "accuracy", desc(penalty))
final_logreg <- office_wf %>%
finalize_workflow(best_params) %>%
fit(data = tr_prepped_noep)
So, the best model here has an accuracy of ~58%. Not great, but better than just straight-up guessing. Remember that this is on the training set. Now, I’ll take a look at what the accuracy is on the test set.
bind_cols(
predict(final_logreg, te_prepped_noep), te_prepped_noep
) %>%
accuracy(is_mike, .pred_class)
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.614
61% – not bad! It’s actually better than the training set accuracy, which means our training process didn’t lead to overfitting, which is great.
Now, I’m going to take a look at which words are the most important to predicting whether the speaker of a line of dialogue is Michael or not.
final_logreg %>%
pull_workflow_fit() %>%
vi() %>%
slice_max(order_by = abs(Importance), n = 10) %>%
ggplot(aes(x = abs(Importance), y = fct_reorder(Variable %>% str_remove("tfidf_text_"), abs(Importance)), fill = Sign)) +
geom_col() +
labs(
title = "Most Important Words Identifying Michael Scott",
subtitle = "Positive values more representative of MS, negative values more representative of others",
y = NULL
)
Not surprisingly, the word “Michael” is the strongest predictor, and has a negative effect – if a line has the word “Michael” in it, it is less likely to be spoken by Michael. Intuitively, this makes sense. Other people use Michael’s name when speaking to or about him. The rest of the effects in this chart make sense to me as well (except for “mifflin” and “dunder,” which I don’t really get). But Michael is certainly more likely to talk about Jan and David than are other characters, and “everybody” feels right to me as well…
And the final thing I’m going to do with this logistic regression is to pull out names of the non-zero coefficients. Recall that the lasso penalty can (but doesn’t always) shrink coefficients to zero. These variables will have no effect on the outcome. The reason I’m doing this is because I want to fit a multilevel model next, but I’m not going to regularize that model. Instead, I’ll just specify a formula that doesn’t include the variables that got shrunk to zero in this model.
Now, I’m going to dive into fitting a multilevel model. To give a very brief overview of multilevel models, they are models that can take into account dependencies (nesting) within data. Recall that one of the assumptions of a linear regression is that each observation is independent. We often violate that assumption in the real world. In my work, for instance, students are often nested within classrooms (i.e. a common effect – their teacher – influences them & introduces a dependency). Another common case of nesting is when you have multiple observations over time from the same set of people. In the case of this current data, we can consider that each line is nested within an episode (terminology note: episode would be the “clustering variable” or “grouping variable” here). We could also go a step further and nest episodes within seasons to get a 3-level model rather than a 2-level model, but I’m not going to do that here.
Fitting multilevel models allows for random effects, where the coefficient of a given term differs based on the clustering variable. Any term in the model can have a random effect, but the simplest form of a multilevel model – and the one I’m going to fit here – is a random intercept model, where the value of the intercept changes depending on the clustering variable. In the current dataset, this would mean that Michael might be more (or less) likely to speak overall in a given episode (when compared to all other episodes), and so the intercept value will change to reflect that. It’s also possible to fit random slopes, where the effect of a given non-intercept term differs from episode to episode. Contextualizing that in the current data, it might mean that the word “Jan” is more (or less) associated with being spoken by Michael depending on the episode. Usually, you want a pretty clear theoretical rationale for specifying random slopes, and I don’t really have that here. Plus, it would be unreasonable to try to estimate random slopes for all of the words in the dataset (even though I only have a subset of ~190).
If you’re interested in learning more about multilevel models, Raudenbush & Bryk (2002) is a classic, and John Fox’s Applied Regression Analysis is just generally a really good book that has a chapter on MLMs.
Anyway – onward and upward. First, I want to specify the formula of the model. I’m going to include all of the variables that had non-zero coefficients in the lasso model earlier, and I’m also going to add a term at the end to specify the random intercept for each episode – (1 | episode_name).
glmm_formula <- as.formula(paste("is_mike ~ ", paste(keep_vars, collapse = " + "), " + (1 | episode_name)"))
I’m going to fit this model using the {glmmTMB}
package, which provides an interface for fitting all sort of generalized linear mixed models. I haven’t used this specific package before, but I have used {lme4}
, which has similar syntax and is essentially the same thing for fitting linear models. I’m going to fit the model using the training data – note that I’m not tuning anything here – and I’m specifying the binomial family because this is a logistic regression.
glmm_fit <- glmmTMB(glmm_formula, data = tr_prepped, family = binomial)
I’m going to show the summary of the model here, but it’s going to be a biiig printout since we have so many terms in the model, so feel free to scroll on by. One thing you might want to check out, though, is the summary of the variance of the intercept, which summarizes the amount of randomness in that effect.
summary(glmm_fit)
Family: binomial ( logit )
Formula:
is_mike ~ tfidf_text_michael + tfidf_text_scott + tfidf_text_friend +
tfidf_text_stanley + tfidf_text_jan + tfidf_text_everybody +
tfidf_text_business + tfidf_text_may + tfidf_text_bad + tfidf_text_pretty +
tfidf_text_say + tfidf_text_dunder + tfidf_text_god + tfidf_text_angela +
tfidf_text_scranton + tfidf_text_somebody + tfidf_text_pam +
tfidf_text_ah + tfidf_text_best + tfidf_text_going + tfidf_text_show +
tfidf_text_life + tfidf_text_guy + tfidf_text_ok + tfidf_text_thing +
tfidf_text_well + tfidf_text_alright + tfidf_text_holly +
tfidf_text_mean + tfidf_text_know + tfidf_text_okay + tfidf_text_toby +
tfidf_text_cause + tfidf_text_coming + tfidf_text_good +
tfidf_text_right + tfidf_text_thinking + tfidf_text_dwight +
tfidf_text_come + tfidf_text_yes + tfidf_text_people + tfidf_text_andy +
tfidf_text_give + tfidf_text_party + tfidf_text_ryan + tfidf_text_today +
tfidf_text_oscar + tfidf_text_fun + tfidf_text_guys + tfidf_text_everyone +
tfidf_text_see + tfidf_text_away + tfidf_text_go + tfidf_text_sure +
tfidf_text_please + tfidf_text_phyllis + tfidf_text_listen +
tfidf_text_believe + tfidf_text_told + tfidf_text_name +
tfidf_text_whole + tfidf_text_ever + tfidf_text_just + tfidf_text_money +
tfidf_text_boss + tfidf_text_ask + tfidf_text_feel + tfidf_text_find +
tfidf_text_three + tfidf_text_need + tfidf_text_made + tfidf_text_long +
tfidf_text_gonna + tfidf_text_hear + tfidf_text_friends +
tfidf_text_wow + tfidf_text_old + tfidf_text_check + tfidf_text_wait +
tfidf_text_head + tfidf_text_hold + tfidf_text_look + tfidf_text_talk +
tfidf_text_want + tfidf_text_company + tfidf_text_room +
tfidf_text_got + tfidf_text_five + tfidf_text_new + tfidf_text_mifflin +
tfidf_text_get + tfidf_text_work + tfidf_text_time + tfidf_text_every +
tfidf_text_thanks + tfidf_text_one + tfidf_text_lot + tfidf_text_mr +
tfidf_text_kevin + tfidf_text_hello + tfidf_text_thought +
tfidf_text_stop + tfidf_text_things + tfidf_text_said + tfidf_text_two +
tfidf_text_sorry + tfidf_text_never + tfidf_text_called +
tfidf_text_oh + tfidf_text_back + tfidf_text_better + tfidf_text_jim +
tfidf_text_much + tfidf_text_hi + tfidf_text_guess + tfidf_text_corporate +
tfidf_text_care + tfidf_text_day + tfidf_text_kind + tfidf_text_little +
tfidf_text_great + tfidf_text_part + tfidf_text_night + tfidf_text_fine +
tfidf_text_take + tfidf_text_put + tfidf_text_saying + tfidf_text_office +
tfidf_text_actually + tfidf_text_morning + tfidf_text_job +
tfidf_text_um + tfidf_text_last + tfidf_text_getting + tfidf_text_around +
tfidf_text_trying + tfidf_text_leave + tfidf_text_whoa +
tfidf_text_idea + tfidf_text_nothing + tfidf_text_wrong +
tfidf_text_went + tfidf_text_help + tfidf_text_first + tfidf_text_love +
tfidf_text_us + tfidf_text_even + tfidf_text_cool + tfidf_text_wanna +
tfidf_text_home + tfidf_text_anything + tfidf_text_might +
tfidf_text_everything + tfidf_text_like + tfidf_text_man +
tfidf_text_car + tfidf_text_now + tfidf_text_real + tfidf_text_paper +
tfidf_text_still + tfidf_text_second + tfidf_text_done +
tfidf_text_happy + tfidf_text_talking + tfidf_text_meet +
tfidf_text_really + tfidf_text_place + tfidf_text_something +
tfidf_text_call + tfidf_text_sales + tfidf_text_thank + tfidf_text_hot +
tfidf_text_yeah + tfidf_text_next + tfidf_text_make + tfidf_text_big +
tfidf_text_together + tfidf_text_can + tfidf_text_many +
tfidf_text_years + tfidf_text_uh + tfidf_text_think + tfidf_text_ready +
tfidf_text_manager + tfidf_text_year + tfidf_text_let + tfidf_text_else +
tfidf_text_way + tfidf_text_maybe + tfidf_text_baby + tfidf_text_probably +
tfidf_text_huh + tfidf_text_tell + tfidf_text_hey + tfidf_text_wanted +
(1 | episode_name)
Data: tr_prepped
AIC BIC logLik deviance df.resid
21739.8 23257.5 -10672.9 21345.8 16183
Random effects:
Conditional model:
Groups Name Variance Std.Dev.
episode_name (Intercept) 0.2519 0.5019
Number of obs: 16380, groups: episode_name, 139
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.284909 0.056210 -5.069 4.01e-07 ***
tfidf_text_michael -1.365494 0.120733 -11.310 < 2e-16 ***
tfidf_text_scott 0.817744 0.174819 4.678 2.90e-06 ***
tfidf_text_friend 0.706557 0.205575 3.437 0.000588 ***
tfidf_text_stanley 0.537424 0.139759 3.845 0.000120 ***
tfidf_text_jan 0.504584 0.119162 4.234 2.29e-05 ***
tfidf_text_everybody 0.520978 0.132091 3.944 8.01e-05 ***
tfidf_text_business 0.563248 0.143005 3.939 8.19e-05 ***
tfidf_text_may 0.503039 0.173855 2.893 0.003810 **
tfidf_text_bad 0.471181 0.146416 3.218 0.001290 **
tfidf_text_pretty -0.412679 0.163791 -2.520 0.011751 *
tfidf_text_say 0.404528 0.085141 4.751 2.02e-06 ***
tfidf_text_dunder 0.556549 0.615429 0.904 0.365822
tfidf_text_god 0.352119 0.081974 4.295 1.74e-05 ***
tfidf_text_angela -0.338799 0.125293 -2.704 0.006850 **
tfidf_text_scranton 0.323568 0.130686 2.476 0.013289 *
tfidf_text_somebody 0.282980 0.121483 2.329 0.019839 *
tfidf_text_pam 0.319905 0.072147 4.434 9.25e-06 ***
tfidf_text_ah 0.309050 0.086264 3.583 0.000340 ***
tfidf_text_best 0.332800 0.139941 2.378 0.017400 *
tfidf_text_going 0.297253 0.067610 4.397 1.10e-05 ***
tfidf_text_show 0.350289 0.150457 2.328 0.019903 *
tfidf_text_life 0.309884 0.136512 2.270 0.023207 *
tfidf_text_guy 0.311064 0.110614 2.812 0.004921 **
tfidf_text_ok 0.272147 0.054382 5.004 5.60e-07 ***
tfidf_text_thing 0.270642 0.103021 2.627 0.008613 **
tfidf_text_well 0.284100 0.056668 5.013 5.35e-07 ***
tfidf_text_alright 0.281422 0.062810 4.481 7.45e-06 ***
tfidf_text_holly 0.303695 0.109919 2.763 0.005729 **
tfidf_text_mean -0.300037 0.088790 -3.379 0.000727 ***
tfidf_text_know 0.273930 0.056596 4.840 1.30e-06 ***
tfidf_text_okay 0.307069 0.044521 6.897 5.30e-12 ***
tfidf_text_toby 0.249094 0.108059 2.305 0.021157 *
tfidf_text_cause -0.248942 0.149764 -1.662 0.096469 .
tfidf_text_coming -0.290930 0.145258 -2.003 0.045193 *
tfidf_text_good 0.268924 0.057343 4.690 2.74e-06 ***
tfidf_text_right 0.259742 0.055691 4.664 3.10e-06 ***
tfidf_text_thinking 0.232674 0.121078 1.922 0.054645 .
tfidf_text_dwight 0.253907 0.063727 3.984 6.77e-05 ***
tfidf_text_come 0.203893 0.071037 2.870 0.004102 **
tfidf_text_yes 0.248846 0.040821 6.096 1.09e-09 ***
tfidf_text_people 0.222956 0.105130 2.121 0.033942 *
tfidf_text_andy -0.157914 0.089205 -1.770 0.076687 .
tfidf_text_give 0.205885 0.083517 2.465 0.013694 *
tfidf_text_party 0.194342 0.093019 2.089 0.036683 *
tfidf_text_ryan 0.188932 0.096564 1.957 0.050401 .
tfidf_text_today 0.214202 0.112770 1.899 0.057505 .
tfidf_text_oscar 0.222130 0.100434 2.212 0.026988 *
tfidf_text_fun 0.189363 0.091348 2.073 0.038174 *
tfidf_text_guys 0.165921 0.080568 2.059 0.039456 *
tfidf_text_everyone -0.168222 0.125141 -1.344 0.178862
tfidf_text_see 0.187325 0.073920 2.534 0.011271 *
tfidf_text_away 0.161108 0.143229 1.125 0.260664
tfidf_text_go 0.220001 0.061381 3.584 0.000338 ***
tfidf_text_sure -0.159186 0.074734 -2.130 0.033169 *
tfidf_text_please 0.161400 0.072772 2.218 0.026563 *
tfidf_text_phyllis 0.141596 0.095280 1.486 0.137251
tfidf_text_listen -0.134684 0.153642 -0.877 0.380700
tfidf_text_believe 0.194606 0.115799 1.681 0.092850 .
tfidf_text_told 0.122873 0.104977 1.170 0.241808
tfidf_text_name 0.186708 0.094160 1.983 0.047382 *
tfidf_text_whole -0.130371 0.136622 -0.954 0.339958
tfidf_text_ever 0.171535 0.102233 1.678 0.093369 .
tfidf_text_just 0.127548 0.064389 1.981 0.047604 *
tfidf_text_money -0.161886 0.134932 -1.200 0.230233
tfidf_text_boss 0.153626 0.151328 1.015 0.310019
tfidf_text_ask 0.123247 0.141323 0.872 0.383154
tfidf_text_feel 0.128474 0.129301 0.994 0.320418
tfidf_text_find 0.164066 0.101733 1.613 0.106806
tfidf_text_three -0.140787 0.108584 -1.297 0.194780
tfidf_text_need 0.162706 0.083187 1.956 0.050476 .
tfidf_text_made 0.123781 0.126288 0.980 0.327016
tfidf_text_long -0.117803 0.138038 -0.853 0.393434
tfidf_text_gonna -0.143375 0.076708 -1.869 0.061610 .
tfidf_text_hear 0.104088 0.102352 1.017 0.309171
tfidf_text_friends 0.084551 0.143475 0.589 0.555655
tfidf_text_wow 0.150225 0.064220 2.339 0.019324 *
tfidf_text_old 0.106273 0.116189 0.915 0.360374
tfidf_text_check 0.114413 0.098850 1.157 0.247094
tfidf_text_wait -0.126842 0.081461 -1.557 0.119448
tfidf_text_head 0.126518 0.126729 0.998 0.318118
tfidf_text_hold 0.141848 0.116016 1.223 0.221459
tfidf_text_look 0.128788 0.071782 1.794 0.072789 .
tfidf_text_talk 0.121851 0.087935 1.386 0.165843
tfidf_text_want 0.116503 0.065304 1.784 0.074422 .
tfidf_text_company 0.147109 0.130804 1.125 0.260739
tfidf_text_room 0.107632 0.107204 1.004 0.315382
tfidf_text_got -0.122130 0.067140 -1.819 0.068907 .
tfidf_text_five -0.096512 0.094074 -1.026 0.304935
tfidf_text_new -0.065516 0.123041 -0.532 0.594396
tfidf_text_mifflin -0.294953 0.601955 -0.490 0.624139
tfidf_text_get 0.126518 0.064409 1.964 0.049496 *
tfidf_text_work 0.110293 0.100668 1.096 0.273247
tfidf_text_time 0.164705 0.082474 1.997 0.045820 *
tfidf_text_every 0.140575 0.158257 0.888 0.374395
tfidf_text_thanks -0.101296 0.074035 -1.368 0.171245
tfidf_text_one -0.079519 0.073817 -1.077 0.281369
tfidf_text_lot 0.089214 0.105700 0.844 0.398651
tfidf_text_mr -0.114169 0.106592 -1.071 0.284131
tfidf_text_kevin 0.114483 0.083866 1.365 0.172233
tfidf_text_hello 0.107170 0.066524 1.611 0.107179
tfidf_text_thought -0.115780 0.096173 -1.204 0.228640
tfidf_text_stop 0.112427 0.067616 1.663 0.096364 .
tfidf_text_things 0.092322 0.123791 0.746 0.455795
tfidf_text_said 0.111146 0.068770 1.616 0.106052
tfidf_text_two -0.064680 0.094019 -0.688 0.491488
tfidf_text_sorry -0.101457 0.077747 -1.305 0.191906
tfidf_text_never 0.101368 0.087415 1.160 0.246206
tfidf_text_called -0.091985 0.127479 -0.722 0.470560
tfidf_text_oh 0.089262 0.047454 1.881 0.059966 .
tfidf_text_back -0.078854 0.087660 -0.900 0.368366
tfidf_text_better -0.060033 0.108630 -0.553 0.580509
tfidf_text_jim -0.095686 0.065332 -1.465 0.143029
tfidf_text_much 0.076737 0.087948 0.873 0.382923
tfidf_text_hi -0.088520 0.073867 -1.198 0.230775
tfidf_text_guess -0.107766 0.127102 -0.848 0.396510
tfidf_text_corporate -0.050761 0.132030 -0.384 0.700633
tfidf_text_care -0.038296 0.138086 -0.277 0.781525
tfidf_text_day 0.096423 0.104991 0.918 0.358413
tfidf_text_kind 0.071155 0.102769 0.692 0.488699
tfidf_text_little 0.124122 0.100206 1.239 0.215468
tfidf_text_great -0.073854 0.069929 -1.056 0.290912
tfidf_text_part 0.030885 0.103786 0.298 0.766023
tfidf_text_night -0.086267 0.119819 -0.720 0.471540
tfidf_text_fine -0.076178 0.084942 -0.897 0.369812
tfidf_text_take -0.055703 0.083438 -0.668 0.504392
tfidf_text_put -0.099941 0.093066 -1.074 0.282877
tfidf_text_saying -0.071405 0.108057 -0.661 0.508738
tfidf_text_office 0.066885 0.097381 0.687 0.492187
tfidf_text_actually 0.069965 0.105800 0.661 0.508421
tfidf_text_morning 0.136949 0.115285 1.188 0.234866
tfidf_text_job -0.020755 0.145005 -0.143 0.886184
tfidf_text_um 0.049639 0.073370 0.677 0.498685
tfidf_text_last -0.050689 0.122785 -0.413 0.679732
tfidf_text_getting 0.089356 0.104741 0.853 0.393598
tfidf_text_around 0.082025 0.147866 0.555 0.579081
tfidf_text_trying 0.073053 0.123007 0.594 0.552583
tfidf_text_leave -0.078307 0.105481 -0.742 0.457857
tfidf_text_whoa 0.081381 0.088434 0.920 0.357445
tfidf_text_idea -0.089929 0.091961 -0.978 0.328122
tfidf_text_nothing 0.063688 0.086331 0.738 0.460687
tfidf_text_wrong 0.086905 0.092908 0.935 0.349586
tfidf_text_went -0.028418 0.125718 -0.226 0.821166
tfidf_text_help 0.040718 0.100003 0.407 0.683888
tfidf_text_first 0.086366 0.122036 0.708 0.479127
tfidf_text_love 0.071734 0.068545 1.047 0.295317
tfidf_text_us -0.040849 0.097550 -0.419 0.675399
tfidf_text_even -0.038807 0.113579 -0.342 0.732596
tfidf_text_cool -0.049036 0.083179 -0.590 0.555514
tfidf_text_wanna -0.048928 0.105429 -0.464 0.642585
tfidf_text_home -0.048600 0.144852 -0.336 0.737238
tfidf_text_anything -0.016118 0.103091 -0.156 0.875756
tfidf_text_might -0.046693 0.133043 -0.351 0.725615
tfidf_text_everything 0.091705 0.119419 0.768 0.442529
tfidf_text_like 0.051476 0.060225 0.855 0.392698
tfidf_text_man -0.048323 0.088421 -0.547 0.584717
tfidf_text_car -0.021845 0.099341 -0.220 0.825948
tfidf_text_now -0.040680 0.084568 -0.481 0.630495
tfidf_text_real 0.052797 0.153645 0.344 0.731124
tfidf_text_paper 0.023861 0.107711 0.222 0.824683
tfidf_text_still 0.074397 0.094324 0.789 0.430265
tfidf_text_second -0.053789 0.125801 -0.428 0.668965
tfidf_text_done 0.035836 0.087322 0.410 0.681526
tfidf_text_happy 0.038974 0.088629 0.440 0.660128
tfidf_text_talking -0.024875 0.072337 -0.344 0.730941
tfidf_text_meet 0.064215 0.119743 0.536 0.591768
tfidf_text_really -0.024577 0.055026 -0.447 0.655128
tfidf_text_place 0.038420 0.114313 0.336 0.736798
tfidf_text_something 0.017752 0.099054 0.179 0.857767
tfidf_text_call -0.035360 0.080138 -0.441 0.659038
tfidf_text_sales -0.013491 0.094396 -0.143 0.886355
tfidf_text_thank 0.042518 0.053183 0.799 0.424022
tfidf_text_hot -0.003221 0.107750 -0.030 0.976151
tfidf_text_yeah -0.023379 0.039966 -0.585 0.558559
tfidf_text_next 0.009379 0.138893 0.068 0.946165
tfidf_text_make -0.009311 0.089531 -0.104 0.917168
tfidf_text_big 0.046667 0.091438 0.510 0.609791
tfidf_text_together -0.021054 0.150126 -0.140 0.888470
tfidf_text_can 0.048398 0.067522 0.717 0.473510
tfidf_text_many 0.036402 0.128211 0.284 0.776468
tfidf_text_years -0.003345 0.126027 -0.027 0.978824
tfidf_text_uh 0.033145 0.065356 0.507 0.612051
tfidf_text_think 0.039276 0.061806 0.635 0.525120
tfidf_text_ready 0.004309 0.082615 0.052 0.958406
tfidf_text_manager -0.005744 0.104443 -0.055 0.956138
tfidf_text_year -0.004601 0.122379 -0.038 0.970012
tfidf_text_let -0.010076 0.105709 -0.095 0.924062
tfidf_text_else 0.036421 0.120693 0.302 0.762831
tfidf_text_way -0.015943 0.086084 -0.185 0.853072
tfidf_text_maybe 0.014926 0.078838 0.189 0.849838
tfidf_text_baby 0.038613 0.100086 0.386 0.699647
tfidf_text_probably 0.012967 0.141661 0.092 0.927068
tfidf_text_huh 0.011606 0.088695 0.131 0.895889
tfidf_text_tell 0.025608 0.073891 0.347 0.728918
tfidf_text_hey 0.016844 0.043492 0.387 0.698544
tfidf_text_wanted -0.003270 0.119500 -0.027 0.978169
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Right, so, the next logical step in my mind is to take a closer look at the random intercepts. We see some variance in the intercept (.23), which suggests that there are meaningful between-episode differences in the number of times Michael Scott speaks. Rather than looking at all of these, let’s take a look at the largest 10 effects (as a benchmark, recall that the mean intercept is -.3)
ranef(glmm_fit) %>%
as.data.frame() %>%
select(grp, condval) %>%
slice_max(order_by = abs(condval), n = 10) %>%
ggplot(aes(x = abs(condval), y = fct_reorder(grp, abs(condval)), fill = if_else(condval > 0, "Pos", "Neg"))) +
geom_col() +
scale_fill_discrete(name = "Sign") +
labs(
y = NULL,
title = "Top Random Intercepts"
)
This plot shows the largest (in absolute value) intercepts. The way to interpret this is that, in these episodes, Michael is more or less likely to speak. The effects of each of the words remains the same across episodes (since I didn’t specify random slopes), but these change the assumed “base rate” that Michael speaks. What we see here makes sense, because Michael actually isn’t in the three episodes that have the highest values here (I should have addressed this in data cleaning – whoops!).
Finally, I’ll take a look at the accuracy of the predictions from the multilevel model.
glmm_preds_response <- predict(glmm_fit, te_prepped, type = "response")
glmm_preds <- ifelse(glmm_preds_response < .5, "No", "Yes") %>% as_factor() %>%
fct_relevel("No", "Yes")
bind_cols(te_prepped$is_mike, glmm_preds) %>%
repair_names() %>%
accuracy(truth = ...1, estimate = ...2)
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.596
It’s a little bit disappointing that the multilevel model isn’t more accurate than the single-level model I ran previously, but one thing to keep in mind is that the single level model was regularized, whereas the multilevel model wasn’t (beyond omitting the variables that got completely omitted from the single level model). So, even though our intercept seems to have a decent amount of variance – meaning random effects are probably warranted – the gains in predictive accuracy we’d get from that are more than offset by the regularization in the first model. There’s probably a way to regularize a multilevel model, but I might save that one for another day. I could also play around with changing the probability threshold for classifying a line as Michael by setting it to something higher than 50% (e.g. a line needs to have a 70% probability before being classified as spoken by Michael), but I’m also not going to go down that rabbit hole here.
So, I’m going to wrap it up for now. And who knows, maybe I’ll revisit this dataset in another 4 months.
For attribution, please cite this work as
EE (2020, Aug. 29). Eric Ekholm: Scrantonicity - Part 3. Retrieved from https://www.ericekholm.com/posts/2021-01-11-scrantonicity-part-3/
BibTeX citation
@misc{ee2020scrantonicity, author = {EE, }, title = {Eric Ekholm: Scrantonicity - Part 3}, url = {https://www.ericekholm.com/posts/2021-01-11-scrantonicity-part-3/}, year = {2020} }