Scrantonicity - Part 3

Predicting the speaker of dialogue from The Office.

EE https://www.ericekholm.com/
08-29-2020

TL;DR:: In this blog, I use LASSO logistic regression and multilevel logistic regression to predict the speaker of lines of dialogue from The Office.

What feels like forever ago, I wrote two blog posts analyzing transcripts from The Office. The first was a basic EDA of the dialogue, and the second used k-means clustering to determine types of Office episodes based on who speaks to whom. At the end of that second blog, I mentioned that I might do some predictive analysis with that data in the future. Well, it’s four months later, and I’m declaring that the future is now!

Basically, the goal here is going to be, for a given line of dialogue from the show, to predict whether it’s Michael talking or someone else. At first blush, this seems like it shouldn’t be too hard. Many of Michael’s lines are iconic (e.g. see the above gif), but I feel like this might be more a function of the delivery than the actual words themselves, and I’m curious to see how well a model (or multiple models) could predict this just from the text.

In doing this, there are a couple of things I’m interested in doing here:

Also, before getting too much further, I want to acknowledge that I looked at this blog by Julia Silge and this blog by Emil Hvitfeldt for some background on {textrecipes}. Both are really great for people interested in text analysis.

Anyway, without much further ado, let’s get into it. As has been the case in all of my “Scrantonicity” posts, the data I’m using here comes from the {schrute} package. First, I’ll load in libraries and set some defaults/options. I’m also going to read in the data, limiting the dialogue to the first seven seasons of the show (the Michael Scott era).

Setup

Brief EDA and Data Preprocessing

Before modeling data, I would typically do a more thorough EDA. But I’ve already explored this data pretty closely (albeit months ago) in two previous blog posts, so rather than re-doing that EDA, I’m just going to look at those posts. One thing I will include here, though, is a quick look at the number of lines spoken by Michael Scott vs other characters, since this is the outcome I’m interested in predicting here.

office %>%
  count(character) %>%
  top_n(10) %>%
  ggplot(aes(x = n, y = fct_reorder(character, n))) +
  geom_col(fill = herm) +
  labs(
    title = "Lines by Character",
    subtitle = "First seven seasons",
    y = NULL,
    x = "Number of Lines"
  )

So, Michael has far and away the most lines of any character. But it’ll also be useful to look at Michael vs all of the others lumped together (since this is what I’m actually predicting).

office %>%
  count(is_mike) %>%
  ggplot(aes(x = n, y = fct_reorder(is_mike, n))) +
  geom_col(fill = herm) +
  labs(
    title = "Mike vs Not Mike",
    y = "Is Michael?",
    x = "Number of Lines"
  )

Even though Michael speaks more than any other given character, he speaks about a third as many lines as all of the other characters combined. This is relevant here because it means I’ll want to downsample when I train my model to ensure the number of observations in each class are similar, which will help the model fit.

Data Splitting & Preprocessing

Next, I’m going to split my data into a training a testing set.

set.seed(0408)
office_split <- initial_split(office, strata = is_mike)
tr <- training(office_split)
te <- testing(office_split)

Now that I’ve split my data, I’m going to preprocess the data using {recipes}, {textrecipes}, and {themis} (to handle class imbalance). One thing to clarify here: I’m building a model to predict whether the speaker of a given line of dialogue is Michael. In this analysis, I want to build this model using only the text data, although there are plenty of other text-based features I could include. More specifically, I am going to handle the preprocessing such that the model I end up fitting is a bag-of-words model. This means that I want my data to include a variable for each word* (not really each word, but I’ll show later) in the transcript, each row to represent a line of dialogue, and the value in each cell to represent the tf-idf of that word. From this data structure, I can build a model where each word has an individual effect on the odds that the line is spoken by Michael, although note that this model will have no sense of word order.

I’ll specify this recipe and then walk through each step afterward.

office_recipe <- recipe(is_mike ~ text + episode_name, data = tr) %>%
  themis::step_downsample(is_mike) %>%
  step_tokenize(text) %>%
  step_stopwords(text) %>%
  step_tokenfilter(text, max_tokens = 200) %>%
  step_tfidf(text) %>%
  prep()
tr_prepped <- juice(office_recipe)
tr_prepped_noep <- tr_prepped %>%
  select(-episode_name)
te_prepped <- bake(office_recipe, te)
te_prepped_noep <- te_prepped %>%
  select(-episode_name)

Let’s unpack this step-by-step:

Another thing to note here is that I’m creating two versions of this preprocessed data for the training and test sets. The differences between “tr_prepped” and “tr_prepped_noep” (as well as their “te” counterparts) is that the “noep” versions do not have a variable identifying which line the episode came from (but are otherwise identical). This is because I don’t want to include the episode identifier in my single-level LASSO model but do want to include it in the multilevel model. I could also accomplish this by specifying the formula and having it not include the episode_number variable rather than creating two datasets.

Moving along! Next, I’m going to specify my model. Since I have a binary outcomes (yes/no if the speaker is Michael), I’m going to run a logistic regression. I’m going to run this as a LASSO model, which will provide some feature selection and generally shrink coefficients. I’m going to tune the model to choose the best amount of penalty as well.

reg_spec <- logistic_reg(mixture = 1, penalty = tune()) %>%
  set_engine("glmnet")
reg_spec
Logistic Regression Model Specification (classification)

Main Arguments:
  penalty = tune()
  mixture = 1

Computational engine: glmnet 

Here, I’m creating some resamples of my training data to help with the tuning. I’m creating 10 bootstrap samples here.

set.seed(0408)
booties <- bootstraps(tr_prepped_noep, strata = is_mike, times = 10)

LASSO Model Fitting & Examination

Now it’s time to fit the LASSO model. I’m going to add the logistic regression specification that I just created to a workflow. Along with that model specification, I’m also going to add a formula where is_mike is regressed on all of the word features I just created. Then, I’m going to tune the model across 10 candidate values of the penalty parameter (i.e. how much regularization I’m adding).

office_wf <- workflow() %>%
  add_model(reg_spec) %>%
  add_formula(is_mike ~ .)
set.seed(0408)
logreg_fit <- tune_grid(
  office_wf,
  resamples = booties,
  grid = 10
)

Great. Now that the models have been fit with various penalty values across the bootstrap resamples, I can check to see what the best penalty value is to move forward with & finalize a model. I’m going to choose the best by one standard error (which, in this case, happens also to be the best model). The one standard error rule will let me choose the most parsimonious model (in this case, the one with the most penalty) that is within one standard error of the best model. And once I choose the best penalty value, I’ll go ahead and finalize the model and refit on the training set.

logreg_fit %>%
  show_best("accuracy")
# A tibble: 5 x 7
   penalty .metric  .estimator  mean     n std_err .config            
     <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>              
1 2.27e- 3 accuracy binary     0.580    10 0.00181 Preprocessor1_Mode~
2 1.02e-10 accuracy binary     0.577    10 0.00216 Preprocessor1_Mode~
3 1.27e- 9 accuracy binary     0.577    10 0.00216 Preprocessor1_Mode~
4 7.94e- 8 accuracy binary     0.577    10 0.00216 Preprocessor1_Mode~
5 4.46e- 7 accuracy binary     0.577    10 0.00216 Preprocessor1_Mode~
best_params <- logreg_fit %>%
  select_by_one_std_err(metric = "accuracy", desc(penalty))
final_logreg <- office_wf %>%
  finalize_workflow(best_params) %>%
  fit(data = tr_prepped_noep)

So, the best model here has an accuracy of ~58%. Not great, but better than just straight-up guessing. Remember that this is on the training set. Now, I’ll take a look at what the accuracy is on the test set.

bind_cols(
  predict(final_logreg, te_prepped_noep), te_prepped_noep
) %>%
  accuracy(is_mike, .pred_class)
# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.614

61% – not bad! It’s actually better than the training set accuracy, which means our training process didn’t lead to overfitting, which is great.

Now, I’m going to take a look at which words are the most important to predicting whether the speaker of a line of dialogue is Michael or not.

final_logreg %>%
  pull_workflow_fit() %>%
  vi() %>%
  slice_max(order_by = abs(Importance), n = 10) %>%
  ggplot(aes(x = abs(Importance), y = fct_reorder(Variable %>% str_remove("tfidf_text_"), abs(Importance)), fill = Sign)) +
  geom_col() +
  labs(
    title = "Most Important Words Identifying Michael Scott",
    subtitle = "Positive values more representative of MS, negative values more representative of others",
    y = NULL
  )

Not surprisingly, the word “Michael” is the strongest predictor, and has a negative effect – if a line has the word “Michael” in it, it is less likely to be spoken by Michael. Intuitively, this makes sense. Other people use Michael’s name when speaking to or about him. The rest of the effects in this chart make sense to me as well (except for “mifflin” and “dunder,” which I don’t really get). But Michael is certainly more likely to talk about Jan and David than are other characters, and “everybody” feels right to me as well…

And the final thing I’m going to do with this logistic regression is to pull out names of the non-zero coefficients. Recall that the lasso penalty can (but doesn’t always) shrink coefficients to zero. These variables will have no effect on the outcome. The reason I’m doing this is because I want to fit a multilevel model next, but I’m not going to regularize that model. Instead, I’ll just specify a formula that doesn’t include the variables that got shrunk to zero in this model.

keep_vars <- final_logreg %>%
  pull_workflow_fit() %>%
  vi() %>%
  filter(Importance != 0) %>%
  pull(Variable)

Multilevel Model Fitting

Now, I’m going to dive into fitting a multilevel model. To give a very brief overview of multilevel models, they are models that can take into account dependencies (nesting) within data. Recall that one of the assumptions of a linear regression is that each observation is independent. We often violate that assumption in the real world. In my work, for instance, students are often nested within classrooms (i.e. a common effect – their teacher – influences them & introduces a dependency). Another common case of nesting is when you have multiple observations over time from the same set of people. In the case of this current data, we can consider that each line is nested within an episode (terminology note: episode would be the “clustering variable” or “grouping variable” here). We could also go a step further and nest episodes within seasons to get a 3-level model rather than a 2-level model, but I’m not going to do that here.

Fitting multilevel models allows for random effects, where the coefficient of a given term differs based on the clustering variable. Any term in the model can have a random effect, but the simplest form of a multilevel model – and the one I’m going to fit here – is a random intercept model, where the value of the intercept changes depending on the clustering variable. In the current dataset, this would mean that Michael might be more (or less) likely to speak overall in a given episode (when compared to all other episodes), and so the intercept value will change to reflect that. It’s also possible to fit random slopes, where the effect of a given non-intercept term differs from episode to episode. Contextualizing that in the current data, it might mean that the word “Jan” is more (or less) associated with being spoken by Michael depending on the episode. Usually, you want a pretty clear theoretical rationale for specifying random slopes, and I don’t really have that here. Plus, it would be unreasonable to try to estimate random slopes for all of the words in the dataset (even though I only have a subset of ~190).

If you’re interested in learning more about multilevel models, Raudenbush & Bryk (2002) is a classic, and John Fox’s Applied Regression Analysis is just generally a really good book that has a chapter on MLMs.

Anyway – onward and upward. First, I want to specify the formula of the model. I’m going to include all of the variables that had non-zero coefficients in the lasso model earlier, and I’m also going to add a term at the end to specify the random intercept for each episode – (1 | episode_name).

glmm_formula <- as.formula(paste("is_mike ~ ", paste(keep_vars, collapse = " + "), " + (1 | episode_name)"))

I’m going to fit this model using the {glmmTMB} package, which provides an interface for fitting all sort of generalized linear mixed models. I haven’t used this specific package before, but I have used {lme4}, which has similar syntax and is essentially the same thing for fitting linear models. I’m going to fit the model using the training data – note that I’m not tuning anything here – and I’m specifying the binomial family because this is a logistic regression.

glmm_fit <- glmmTMB(glmm_formula, data = tr_prepped, family = binomial)

I’m going to show the summary of the model here, but it’s going to be a biiig printout since we have so many terms in the model, so feel free to scroll on by. One thing you might want to check out, though, is the summary of the variance of the intercept, which summarizes the amount of randomness in that effect.

summary(glmm_fit)
 Family: binomial  ( logit )
Formula:          
is_mike ~ tfidf_text_michael + tfidf_text_scott + tfidf_text_friend +  
    tfidf_text_stanley + tfidf_text_jan + tfidf_text_everybody +  
    tfidf_text_business + tfidf_text_may + tfidf_text_bad + tfidf_text_pretty +  
    tfidf_text_say + tfidf_text_dunder + tfidf_text_god + tfidf_text_angela +  
    tfidf_text_scranton + tfidf_text_somebody + tfidf_text_pam +  
    tfidf_text_ah + tfidf_text_best + tfidf_text_going + tfidf_text_show +  
    tfidf_text_life + tfidf_text_guy + tfidf_text_ok + tfidf_text_thing +  
    tfidf_text_well + tfidf_text_alright + tfidf_text_holly +  
    tfidf_text_mean + tfidf_text_know + tfidf_text_okay + tfidf_text_toby +  
    tfidf_text_cause + tfidf_text_coming + tfidf_text_good +  
    tfidf_text_right + tfidf_text_thinking + tfidf_text_dwight +  
    tfidf_text_come + tfidf_text_yes + tfidf_text_people + tfidf_text_andy +  
    tfidf_text_give + tfidf_text_party + tfidf_text_ryan + tfidf_text_today +  
    tfidf_text_oscar + tfidf_text_fun + tfidf_text_guys + tfidf_text_everyone +  
    tfidf_text_see + tfidf_text_away + tfidf_text_go + tfidf_text_sure +  
    tfidf_text_please + tfidf_text_phyllis + tfidf_text_listen +  
    tfidf_text_believe + tfidf_text_told + tfidf_text_name +  
    tfidf_text_whole + tfidf_text_ever + tfidf_text_just + tfidf_text_money +  
    tfidf_text_boss + tfidf_text_ask + tfidf_text_feel + tfidf_text_find +  
    tfidf_text_three + tfidf_text_need + tfidf_text_made + tfidf_text_long +  
    tfidf_text_gonna + tfidf_text_hear + tfidf_text_friends +  
    tfidf_text_wow + tfidf_text_old + tfidf_text_check + tfidf_text_wait +  
    tfidf_text_head + tfidf_text_hold + tfidf_text_look + tfidf_text_talk +  
    tfidf_text_want + tfidf_text_company + tfidf_text_room +  
    tfidf_text_got + tfidf_text_five + tfidf_text_new + tfidf_text_mifflin +  
    tfidf_text_get + tfidf_text_work + tfidf_text_time + tfidf_text_every +  
    tfidf_text_thanks + tfidf_text_one + tfidf_text_lot + tfidf_text_mr +  
    tfidf_text_kevin + tfidf_text_hello + tfidf_text_thought +  
    tfidf_text_stop + tfidf_text_things + tfidf_text_said + tfidf_text_two +  
    tfidf_text_sorry + tfidf_text_never + tfidf_text_called +  
    tfidf_text_oh + tfidf_text_back + tfidf_text_better + tfidf_text_jim +  
    tfidf_text_much + tfidf_text_hi + tfidf_text_guess + tfidf_text_corporate +  
    tfidf_text_care + tfidf_text_day + tfidf_text_kind + tfidf_text_little +  
    tfidf_text_great + tfidf_text_part + tfidf_text_night + tfidf_text_fine +  
    tfidf_text_take + tfidf_text_put + tfidf_text_saying + tfidf_text_office +  
    tfidf_text_actually + tfidf_text_morning + tfidf_text_job +  
    tfidf_text_um + tfidf_text_last + tfidf_text_getting + tfidf_text_around +  
    tfidf_text_trying + tfidf_text_leave + tfidf_text_whoa +  
    tfidf_text_idea + tfidf_text_nothing + tfidf_text_wrong +  
    tfidf_text_went + tfidf_text_help + tfidf_text_first + tfidf_text_love +  
    tfidf_text_us + tfidf_text_even + tfidf_text_cool + tfidf_text_wanna +  
    tfidf_text_home + tfidf_text_anything + tfidf_text_might +  
    tfidf_text_everything + tfidf_text_like + tfidf_text_man +  
    tfidf_text_car + tfidf_text_now + tfidf_text_real + tfidf_text_paper +  
    tfidf_text_still + tfidf_text_second + tfidf_text_done +  
    tfidf_text_happy + tfidf_text_talking + tfidf_text_meet +  
    tfidf_text_really + tfidf_text_place + tfidf_text_something +  
    tfidf_text_call + tfidf_text_sales + tfidf_text_thank + tfidf_text_hot +  
    tfidf_text_yeah + tfidf_text_next + tfidf_text_make + tfidf_text_big +  
    tfidf_text_together + tfidf_text_can + tfidf_text_many +  
    tfidf_text_years + tfidf_text_uh + tfidf_text_think + tfidf_text_ready +  
    tfidf_text_manager + tfidf_text_year + tfidf_text_let + tfidf_text_else +  
    tfidf_text_way + tfidf_text_maybe + tfidf_text_baby + tfidf_text_probably +  
    tfidf_text_huh + tfidf_text_tell + tfidf_text_hey + tfidf_text_wanted +  
    (1 | episode_name)
Data: tr_prepped

     AIC      BIC   logLik deviance df.resid 
 21739.8  23257.5 -10672.9  21345.8    16183 

Random effects:

Conditional model:
 Groups       Name        Variance Std.Dev.
 episode_name (Intercept) 0.2519   0.5019  
Number of obs: 16380, groups:  episode_name, 139

Conditional model:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)           -0.284909   0.056210  -5.069 4.01e-07 ***
tfidf_text_michael    -1.365494   0.120733 -11.310  < 2e-16 ***
tfidf_text_scott       0.817744   0.174819   4.678 2.90e-06 ***
tfidf_text_friend      0.706557   0.205575   3.437 0.000588 ***
tfidf_text_stanley     0.537424   0.139759   3.845 0.000120 ***
tfidf_text_jan         0.504584   0.119162   4.234 2.29e-05 ***
tfidf_text_everybody   0.520978   0.132091   3.944 8.01e-05 ***
tfidf_text_business    0.563248   0.143005   3.939 8.19e-05 ***
tfidf_text_may         0.503039   0.173855   2.893 0.003810 ** 
tfidf_text_bad         0.471181   0.146416   3.218 0.001290 ** 
tfidf_text_pretty     -0.412679   0.163791  -2.520 0.011751 *  
tfidf_text_say         0.404528   0.085141   4.751 2.02e-06 ***
tfidf_text_dunder      0.556549   0.615429   0.904 0.365822    
tfidf_text_god         0.352119   0.081974   4.295 1.74e-05 ***
tfidf_text_angela     -0.338799   0.125293  -2.704 0.006850 ** 
tfidf_text_scranton    0.323568   0.130686   2.476 0.013289 *  
tfidf_text_somebody    0.282980   0.121483   2.329 0.019839 *  
tfidf_text_pam         0.319905   0.072147   4.434 9.25e-06 ***
tfidf_text_ah          0.309050   0.086264   3.583 0.000340 ***
tfidf_text_best        0.332800   0.139941   2.378 0.017400 *  
tfidf_text_going       0.297253   0.067610   4.397 1.10e-05 ***
tfidf_text_show        0.350289   0.150457   2.328 0.019903 *  
tfidf_text_life        0.309884   0.136512   2.270 0.023207 *  
tfidf_text_guy         0.311064   0.110614   2.812 0.004921 ** 
tfidf_text_ok          0.272147   0.054382   5.004 5.60e-07 ***
tfidf_text_thing       0.270642   0.103021   2.627 0.008613 ** 
tfidf_text_well        0.284100   0.056668   5.013 5.35e-07 ***
tfidf_text_alright     0.281422   0.062810   4.481 7.45e-06 ***
tfidf_text_holly       0.303695   0.109919   2.763 0.005729 ** 
tfidf_text_mean       -0.300037   0.088790  -3.379 0.000727 ***
tfidf_text_know        0.273930   0.056596   4.840 1.30e-06 ***
tfidf_text_okay        0.307069   0.044521   6.897 5.30e-12 ***
tfidf_text_toby        0.249094   0.108059   2.305 0.021157 *  
tfidf_text_cause      -0.248942   0.149764  -1.662 0.096469 .  
tfidf_text_coming     -0.290930   0.145258  -2.003 0.045193 *  
tfidf_text_good        0.268924   0.057343   4.690 2.74e-06 ***
tfidf_text_right       0.259742   0.055691   4.664 3.10e-06 ***
tfidf_text_thinking    0.232674   0.121078   1.922 0.054645 .  
tfidf_text_dwight      0.253907   0.063727   3.984 6.77e-05 ***
tfidf_text_come        0.203893   0.071037   2.870 0.004102 ** 
tfidf_text_yes         0.248846   0.040821   6.096 1.09e-09 ***
tfidf_text_people      0.222956   0.105130   2.121 0.033942 *  
tfidf_text_andy       -0.157914   0.089205  -1.770 0.076687 .  
tfidf_text_give        0.205885   0.083517   2.465 0.013694 *  
tfidf_text_party       0.194342   0.093019   2.089 0.036683 *  
tfidf_text_ryan        0.188932   0.096564   1.957 0.050401 .  
tfidf_text_today       0.214202   0.112770   1.899 0.057505 .  
tfidf_text_oscar       0.222130   0.100434   2.212 0.026988 *  
tfidf_text_fun         0.189363   0.091348   2.073 0.038174 *  
tfidf_text_guys        0.165921   0.080568   2.059 0.039456 *  
tfidf_text_everyone   -0.168222   0.125141  -1.344 0.178862    
tfidf_text_see         0.187325   0.073920   2.534 0.011271 *  
tfidf_text_away        0.161108   0.143229   1.125 0.260664    
tfidf_text_go          0.220001   0.061381   3.584 0.000338 ***
tfidf_text_sure       -0.159186   0.074734  -2.130 0.033169 *  
tfidf_text_please      0.161400   0.072772   2.218 0.026563 *  
tfidf_text_phyllis     0.141596   0.095280   1.486 0.137251    
tfidf_text_listen     -0.134684   0.153642  -0.877 0.380700    
tfidf_text_believe     0.194606   0.115799   1.681 0.092850 .  
tfidf_text_told        0.122873   0.104977   1.170 0.241808    
tfidf_text_name        0.186708   0.094160   1.983 0.047382 *  
tfidf_text_whole      -0.130371   0.136622  -0.954 0.339958    
tfidf_text_ever        0.171535   0.102233   1.678 0.093369 .  
tfidf_text_just        0.127548   0.064389   1.981 0.047604 *  
tfidf_text_money      -0.161886   0.134932  -1.200 0.230233    
tfidf_text_boss        0.153626   0.151328   1.015 0.310019    
tfidf_text_ask         0.123247   0.141323   0.872 0.383154    
tfidf_text_feel        0.128474   0.129301   0.994 0.320418    
tfidf_text_find        0.164066   0.101733   1.613 0.106806    
tfidf_text_three      -0.140787   0.108584  -1.297 0.194780    
tfidf_text_need        0.162706   0.083187   1.956 0.050476 .  
tfidf_text_made        0.123781   0.126288   0.980 0.327016    
tfidf_text_long       -0.117803   0.138038  -0.853 0.393434    
tfidf_text_gonna      -0.143375   0.076708  -1.869 0.061610 .  
tfidf_text_hear        0.104088   0.102352   1.017 0.309171    
tfidf_text_friends     0.084551   0.143475   0.589 0.555655    
tfidf_text_wow         0.150225   0.064220   2.339 0.019324 *  
tfidf_text_old         0.106273   0.116189   0.915 0.360374    
tfidf_text_check       0.114413   0.098850   1.157 0.247094    
tfidf_text_wait       -0.126842   0.081461  -1.557 0.119448    
tfidf_text_head        0.126518   0.126729   0.998 0.318118    
tfidf_text_hold        0.141848   0.116016   1.223 0.221459    
tfidf_text_look        0.128788   0.071782   1.794 0.072789 .  
tfidf_text_talk        0.121851   0.087935   1.386 0.165843    
tfidf_text_want        0.116503   0.065304   1.784 0.074422 .  
tfidf_text_company     0.147109   0.130804   1.125 0.260739    
tfidf_text_room        0.107632   0.107204   1.004 0.315382    
tfidf_text_got        -0.122130   0.067140  -1.819 0.068907 .  
tfidf_text_five       -0.096512   0.094074  -1.026 0.304935    
tfidf_text_new        -0.065516   0.123041  -0.532 0.594396    
tfidf_text_mifflin    -0.294953   0.601955  -0.490 0.624139    
tfidf_text_get         0.126518   0.064409   1.964 0.049496 *  
tfidf_text_work        0.110293   0.100668   1.096 0.273247    
tfidf_text_time        0.164705   0.082474   1.997 0.045820 *  
tfidf_text_every       0.140575   0.158257   0.888 0.374395    
tfidf_text_thanks     -0.101296   0.074035  -1.368 0.171245    
tfidf_text_one        -0.079519   0.073817  -1.077 0.281369    
tfidf_text_lot         0.089214   0.105700   0.844 0.398651    
tfidf_text_mr         -0.114169   0.106592  -1.071 0.284131    
tfidf_text_kevin       0.114483   0.083866   1.365 0.172233    
tfidf_text_hello       0.107170   0.066524   1.611 0.107179    
tfidf_text_thought    -0.115780   0.096173  -1.204 0.228640    
tfidf_text_stop        0.112427   0.067616   1.663 0.096364 .  
tfidf_text_things      0.092322   0.123791   0.746 0.455795    
tfidf_text_said        0.111146   0.068770   1.616 0.106052    
tfidf_text_two        -0.064680   0.094019  -0.688 0.491488    
tfidf_text_sorry      -0.101457   0.077747  -1.305 0.191906    
tfidf_text_never       0.101368   0.087415   1.160 0.246206    
tfidf_text_called     -0.091985   0.127479  -0.722 0.470560    
tfidf_text_oh          0.089262   0.047454   1.881 0.059966 .  
tfidf_text_back       -0.078854   0.087660  -0.900 0.368366    
tfidf_text_better     -0.060033   0.108630  -0.553 0.580509    
tfidf_text_jim        -0.095686   0.065332  -1.465 0.143029    
tfidf_text_much        0.076737   0.087948   0.873 0.382923    
tfidf_text_hi         -0.088520   0.073867  -1.198 0.230775    
tfidf_text_guess      -0.107766   0.127102  -0.848 0.396510    
tfidf_text_corporate  -0.050761   0.132030  -0.384 0.700633    
tfidf_text_care       -0.038296   0.138086  -0.277 0.781525    
tfidf_text_day         0.096423   0.104991   0.918 0.358413    
tfidf_text_kind        0.071155   0.102769   0.692 0.488699    
tfidf_text_little      0.124122   0.100206   1.239 0.215468    
tfidf_text_great      -0.073854   0.069929  -1.056 0.290912    
tfidf_text_part        0.030885   0.103786   0.298 0.766023    
tfidf_text_night      -0.086267   0.119819  -0.720 0.471540    
tfidf_text_fine       -0.076178   0.084942  -0.897 0.369812    
tfidf_text_take       -0.055703   0.083438  -0.668 0.504392    
tfidf_text_put        -0.099941   0.093066  -1.074 0.282877    
tfidf_text_saying     -0.071405   0.108057  -0.661 0.508738    
tfidf_text_office      0.066885   0.097381   0.687 0.492187    
tfidf_text_actually    0.069965   0.105800   0.661 0.508421    
tfidf_text_morning     0.136949   0.115285   1.188 0.234866    
tfidf_text_job        -0.020755   0.145005  -0.143 0.886184    
tfidf_text_um          0.049639   0.073370   0.677 0.498685    
tfidf_text_last       -0.050689   0.122785  -0.413 0.679732    
tfidf_text_getting     0.089356   0.104741   0.853 0.393598    
tfidf_text_around      0.082025   0.147866   0.555 0.579081    
tfidf_text_trying      0.073053   0.123007   0.594 0.552583    
tfidf_text_leave      -0.078307   0.105481  -0.742 0.457857    
tfidf_text_whoa        0.081381   0.088434   0.920 0.357445    
tfidf_text_idea       -0.089929   0.091961  -0.978 0.328122    
tfidf_text_nothing     0.063688   0.086331   0.738 0.460687    
tfidf_text_wrong       0.086905   0.092908   0.935 0.349586    
tfidf_text_went       -0.028418   0.125718  -0.226 0.821166    
tfidf_text_help        0.040718   0.100003   0.407 0.683888    
tfidf_text_first       0.086366   0.122036   0.708 0.479127    
tfidf_text_love        0.071734   0.068545   1.047 0.295317    
tfidf_text_us         -0.040849   0.097550  -0.419 0.675399    
tfidf_text_even       -0.038807   0.113579  -0.342 0.732596    
tfidf_text_cool       -0.049036   0.083179  -0.590 0.555514    
tfidf_text_wanna      -0.048928   0.105429  -0.464 0.642585    
tfidf_text_home       -0.048600   0.144852  -0.336 0.737238    
tfidf_text_anything   -0.016118   0.103091  -0.156 0.875756    
tfidf_text_might      -0.046693   0.133043  -0.351 0.725615    
tfidf_text_everything  0.091705   0.119419   0.768 0.442529    
tfidf_text_like        0.051476   0.060225   0.855 0.392698    
tfidf_text_man        -0.048323   0.088421  -0.547 0.584717    
tfidf_text_car        -0.021845   0.099341  -0.220 0.825948    
tfidf_text_now        -0.040680   0.084568  -0.481 0.630495    
tfidf_text_real        0.052797   0.153645   0.344 0.731124    
tfidf_text_paper       0.023861   0.107711   0.222 0.824683    
tfidf_text_still       0.074397   0.094324   0.789 0.430265    
tfidf_text_second     -0.053789   0.125801  -0.428 0.668965    
tfidf_text_done        0.035836   0.087322   0.410 0.681526    
tfidf_text_happy       0.038974   0.088629   0.440 0.660128    
tfidf_text_talking    -0.024875   0.072337  -0.344 0.730941    
tfidf_text_meet        0.064215   0.119743   0.536 0.591768    
tfidf_text_really     -0.024577   0.055026  -0.447 0.655128    
tfidf_text_place       0.038420   0.114313   0.336 0.736798    
tfidf_text_something   0.017752   0.099054   0.179 0.857767    
tfidf_text_call       -0.035360   0.080138  -0.441 0.659038    
tfidf_text_sales      -0.013491   0.094396  -0.143 0.886355    
tfidf_text_thank       0.042518   0.053183   0.799 0.424022    
tfidf_text_hot        -0.003221   0.107750  -0.030 0.976151    
tfidf_text_yeah       -0.023379   0.039966  -0.585 0.558559    
tfidf_text_next        0.009379   0.138893   0.068 0.946165    
tfidf_text_make       -0.009311   0.089531  -0.104 0.917168    
tfidf_text_big         0.046667   0.091438   0.510 0.609791    
tfidf_text_together   -0.021054   0.150126  -0.140 0.888470    
tfidf_text_can         0.048398   0.067522   0.717 0.473510    
tfidf_text_many        0.036402   0.128211   0.284 0.776468    
tfidf_text_years      -0.003345   0.126027  -0.027 0.978824    
tfidf_text_uh          0.033145   0.065356   0.507 0.612051    
tfidf_text_think       0.039276   0.061806   0.635 0.525120    
tfidf_text_ready       0.004309   0.082615   0.052 0.958406    
tfidf_text_manager    -0.005744   0.104443  -0.055 0.956138    
tfidf_text_year       -0.004601   0.122379  -0.038 0.970012    
tfidf_text_let        -0.010076   0.105709  -0.095 0.924062    
tfidf_text_else        0.036421   0.120693   0.302 0.762831    
tfidf_text_way        -0.015943   0.086084  -0.185 0.853072    
tfidf_text_maybe       0.014926   0.078838   0.189 0.849838    
tfidf_text_baby        0.038613   0.100086   0.386 0.699647    
tfidf_text_probably    0.012967   0.141661   0.092 0.927068    
tfidf_text_huh         0.011606   0.088695   0.131 0.895889    
tfidf_text_tell        0.025608   0.073891   0.347 0.728918    
tfidf_text_hey         0.016844   0.043492   0.387 0.698544    
tfidf_text_wanted     -0.003270   0.119500  -0.027 0.978169    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Right, so, the next logical step in my mind is to take a closer look at the random intercepts. We see some variance in the intercept (.23), which suggests that there are meaningful between-episode differences in the number of times Michael Scott speaks. Rather than looking at all of these, let’s take a look at the largest 10 effects (as a benchmark, recall that the mean intercept is -.3)

ranef(glmm_fit) %>%
  as.data.frame() %>%
  select(grp, condval) %>%
  slice_max(order_by = abs(condval), n = 10) %>%
  ggplot(aes(x = abs(condval), y = fct_reorder(grp, abs(condval)), fill = if_else(condval > 0, "Pos", "Neg"))) +
  geom_col() +
  scale_fill_discrete(name = "Sign") +
  labs(
    y = NULL,
    title = "Top Random Intercepts"
  )

This plot shows the largest (in absolute value) intercepts. The way to interpret this is that, in these episodes, Michael is more or less likely to speak. The effects of each of the words remains the same across episodes (since I didn’t specify random slopes), but these change the assumed “base rate” that Michael speaks. What we see here makes sense, because Michael actually isn’t in the three episodes that have the highest values here (I should have addressed this in data cleaning – whoops!).

Finally, I’ll take a look at the accuracy of the predictions from the multilevel model.

glmm_preds_response <- predict(glmm_fit, te_prepped, type = "response")
glmm_preds <- ifelse(glmm_preds_response < .5, "No", "Yes") %>% as_factor() %>%
  fct_relevel("No", "Yes")
bind_cols(te_prepped$is_mike, glmm_preds) %>%
  repair_names() %>%
  accuracy(truth = ...1, estimate = ...2)
# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.596

It’s a little bit disappointing that the multilevel model isn’t more accurate than the single-level model I ran previously, but one thing to keep in mind is that the single level model was regularized, whereas the multilevel model wasn’t (beyond omitting the variables that got completely omitted from the single level model). So, even though our intercept seems to have a decent amount of variance – meaning random effects are probably warranted – the gains in predictive accuracy we’d get from that are more than offset by the regularization in the first model. There’s probably a way to regularize a multilevel model, but I might save that one for another day. I could also play around with changing the probability threshold for classifying a line as Michael by setting it to something higher than 50% (e.g. a line needs to have a 70% probability before being classified as spoken by Michael), but I’m also not going to go down that rabbit hole here.

So, I’m going to wrap it up for now. And who knows, maybe I’ll revisit this dataset in another 4 months.

Citation

For attribution, please cite this work as

EE (2020, Aug. 29). Eric Ekholm: Scrantonicity - Part 3. Retrieved from https://www.ericekholm.com/posts/2021-01-11-scrantonicity-part-3/

BibTeX citation

@misc{ee2020scrantonicity,
  author = {EE, },
  title = {Eric Ekholm: Scrantonicity - Part 3},
  url = {https://www.ericekholm.com/posts/2021-01-11-scrantonicity-part-3/},
  year = {2020}
}