Stranger Strings

Tokenizing and counting Stranger Things dialogue using Julia

Julia
Text Analysis
TidyTuesday
Published

October 26, 2022

In my quest to continue learning how to do things in Julia, I wanted to play around with last week’s #TidyTuesday dataset, which was the dialogue from every episode of Stranger Things. In data-analysis-dabbling in Julia so far, I’ve more or less avoided strings. This has mostly been because I’ve been focusing on numerical topics (like maximum likelihood estimation), but also because working with strings can be a pain. That said, it felt like time to explore strings in Julia, and this dataset provided a good opportunity to practice.

The goal of this analysis is going to be do something fairly straightforward – I’m going to count the most-frequently used words in the series. But this will require learning some fundamental tools like tokenizing, pivoting/reshaping data, and cleaning text data, among others.

As always, the point of this is to work through my own learning process. I’m certainly not claiming to be an expert, and if you are an expert and can recommend better approaches, I’d love to hear them!

So let’s get to it.

Setup and Examine Data

First, let’s load the packages we’ll use and read the data in:

using CSV #for reading CSVs
using DataFrames #dataframe utilities
using Chain #chain macro, similar to R's pipe
using Languages #for stopwords
using CairoMakie #plotting
using Statistics #for median

st_things_dialogue = CSV.read(download("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-10-18/stranger_things_all_dialogue.csv"), DataFrame);

And then we can look at the size of the dataframe:

size(st_things_dialogue)
(32519, 8)

As well as see the first few rows:

first(st_things_dialogue, 3)

3 rows × 8 columns (omitted printing of 1 columns)

seasonepisodelineraw_textstage_directiondialoguestart_time
Int64Int64Int64StringString?String?Time
1111[crickets chirping][crickets chirping]missing00:00:07
2112[alarm blaring][alarm blaring]missing00:00:49
3113[panting][panting]missing00:00:52

So we can see that dialogue might be missing if the line is just stage directions. For our purposes here, let’s just use the lines with dialogue. To do this, we can use the dropmissing() function and then pass in the Dataframe and the column we want to only keep complete cases of, which is :dialogue in this case. Note that Julia uses : to denote symbols.

dialogue_complete = dropmissing(st_things_dialogue, :dialogue)

26,435 rows × 8 columns (omitted printing of 4 columns)

seasonepisodelineraw_text
Int64Int64Int64String
1119[Mike] Something is coming. Something hungry for blood.
21110A shadow grows on the wall behind you, swallowing you in darkness.
31111-It is almost here. -What is it?
41112What if it's the Demogorgon?
51113Oh, Jesus, we're so screwed if it's the Demogorgon.
61114It's not the Demogorgon.
71115An army of troglodytes charge into the chamber!
81116-Troglodytes? -Told ya. [chuckling]
91117-[snorts] -[all chuckling]
101118[softly] Wait a minute.
111119Did you hear that?
121120That... that sound?
131121Boom... boom...
141122-[yells] Boom! -[slams table]
151123That didn't come from the troglodytes. No, that...
161124That came from something else.
171125-The Demogorgon! -[all groaning]
181126-We're in deep shit. -Will, your action!
191127-I don't know! -Fireball him!
201128I'd have to roll a 13 or higher!
211129Too risky. Cast a protection spell.
221130-Don't be a pussy. Fireball him! -Cast Protection.
231131The Demogorgon is tired of your silly human bickering!
241132It stomps towards you.
251133-Boom! -Fireball him!
261134-Another stomp, boom! -Cast Protection.
271135-He roars in anger! -[all clamoring]
281136-Fireball! -[die clattering]
291137-Oh, shit! -[Lucas] Where'd it go?
301138[Lucas] Where is it? [Will] I don't know!

Reshape Data

Cool, so this will get us just rows that actually have dialogue. But what we can see is that each row is a line of dialogue, whereas we actually want to tokenize this so that each row is a word.

To do this, we can use the split function, which lets us split a string at whatever delimiter we provide. In this case, that’s a space. For example:

split("a man a plan a canal panama", " ")
7-element Vector{SubString{String}}:
 "a"
 "man"
 "a"
 "plan"
 "a"
 "canal"
 "panama"

Or, using our actual data:

split(dialogue_complete.dialogue[1], " ")
7-element Vector{SubString{String}}:
 "Something"
 "is"
 "coming."
 "Something"
 "hungry"
 "for"
 "blood."

It’s worth noting that, by default, split() will split on spaces, so we can just call the default function without the final argument as well:

split(dialogue_complete.dialogue[1])
7-element Vector{SubString{String}}:
 "Something"
 "is"
 "coming."
 "Something"
 "hungry"
 "for"
 "blood."

So this gives us the first step of what we want to do in tokenizing the dialogue.

Let’s start putting this into a chain, which is similar to R’s pipe concept. And apparently there are several different chains/pipes in Julia, but the Chain.jl package seems reasonable to me so let’s just use that one.

We can begin a chain operation with the @chain macro, then pass the dataframe name and a begin keyword. We then do all of our operations, then pass the end keyword. Like tidyverse functions in R, most of Julia’s DataFrame functions expect a dataframe as the first argument, which makes them work well with chains.

df_split = @chain dialogue_complete begin
    select(
        :season,
        :episode,
        :line,
        :dialogue => ByRow(split) => :dialogue_split
    )
    end

26,435 rows × 4 columns

seasonepisodelinedialogue_split
Int64Int64Int64Array…
1119["Something", "is", "coming.", "Something", "hungry", "for", "blood."]
21110["A", "shadow", "grows", "on", "the", "wall", "behind", "you,", "swallowing", "you", "in", "darkness."]
31111["It", "is", "almost", "here.", "What", "is", "it?"]
41112["What", "if", "it's", "the", "Demogorgon?"]
51113["Oh,", "Jesus,", "we're", "so", "screwed", "if", "it's", "the", "Demogorgon."]
61114["It's", "not", "the", "Demogorgon."]
71115["An", "army", "of", "troglodytes", "charge", "into", "the", "chamber!"]
81116["Troglodytes?", "Told", "ya."]
91117[]
101118["Wait", "a", "minute."]
111119["Did", "you", "hear", "that?"]
121120["That...", "that", "sound?"]
131121["Boom...", "boom..."]
141122["Boom!"]
151123["That", "didn't", "come", "from", "the", "troglodytes.", "No,", "that..."]
161124["That", "came", "from", "something", "else."]
171125["The", "Demogorgon!"]
181126["We're", "in", "deep", "shit.", "Will,", "your", "action!"]
191127["I", "don't", "know!", "Fireball", "him!"]
201128["I'd", "have", "to", "roll", "a", "13", "or", "higher!"]
211129["Too", "risky.", "Cast", "a", "protection", "spell."]
221130["Don't", "be", "a", "pussy.", "Fireball", "him!", "Cast", "Protection."]
231131["The", "Demogorgon", "is", "tired", "of", "your", "silly", "human", "bickering!"]
241132["It", "stomps", "towards", "you."]
251133["Boom!", "Fireball", "him!"]
261134["Another", "stomp,", "boom!", "Cast", "Protection."]
271135["He", "roars", "in", "anger!"]
281136["Fireball!"]
291137["Oh,", "shit!", "Where'd", "it", "go?"]
301138["Where", "is", "it?", "I", "don't", "know!"]

Technically we don’t need to chain anything above since we’re just doing one operation (select()) right now, but we’ll add more soon.

One thing you might notice in the final line within select() is Julia’s notation for “doing things” is input_col => function => output_col. In the case above, we’re supplying an anonymous function (which is that x -> fun(x, …)) syntax, and wrapping that in a special ByRow() function that facilitates broadcasting in dataframe operations.

All that said, the above doesn’t quite give us what we want if we look at the first two rows of output:

first(df_split, 2)

2 rows × 4 columns

seasonepisodelinedialogue_split
Int64Int64Int64Array…
1119["Something", "is", "coming.", "Something", "hungry", "for", "blood."]
21110["A", "shadow", "grows", "on", "the", "wall", "behind", "you,", "swallowing", "you", "in", "darkness."]

Our dialogue_split column is a vector of vectors. To get around this, we want to flatten the column so that each row contains a single word. The nice thing about our chain operation above is that we can just plunk the flatten() function right on the end to do this:

df_split = @chain dialogue_complete begin
    select(
        :season,
        :episode,
        :line,
        :dialogue => ByRow(split) => :dialogue_split
    )
    flatten(:dialogue_split)
end

145,243 rows × 4 columns

seasonepisodelinedialogue_split
Int64Int64Int64SubStrin…
1119Something
2119is
3119coming.
4119Something
5119hungry
6119for
7119blood.
81110A
91110shadow
101110grows
111110on
121110the
131110wall
141110behind
151110you,
161110swallowing
171110you
181110in
191110darkness.
201111It
211111is
221111almost
231111here.
241111What
251111is
261111it?
271112What
281112if
291112it's
301112the

Better! Now let’s check out the first 10 elements of our dialogue split column:

show(first(df_split.:dialogue_split, 10))
SubString{String}["Something", "is", "coming.", "Something", "hungry", "for", "blood.", "A", "shadow", "grows"]

Clean Text

So, it’s not ideal that we have punctuation in here. We don’t want, for instance “blood” to be considered a different word than “blood.” when we count words later. Same deal for uppercase and lowercase letters – we want “something” to be the same as “Something”. So we need to strip punctuation and lowercase everything.

First, we can write a small little function to strip punctuation.

function strip_punc(x)
    strip(x, [',', ';', '.', '?', '!'])
end
strip_punc (generic function with 1 method)

And Julia already has a lowercase() function built in. Now, let’s jam these on the end of the chain we already have:

df_split = @chain dialogue_complete begin
    select(
        :season,
        :episode,
        :line,
        :dialogue => ByRow(split) => :dialogue_split
    )
    flatten(:dialogue_split)
    transform(:dialogue_split => ByRow(lowercase) => :dialogue_split)
    transform(:dialogue_split => ByRow(strip_punc) => :dialogue_stripped)
end

145,243 rows × 5 columns

seasonepisodelinedialogue_splitdialogue_stripped
Int64Int64Int64StringSubStrin…
1119somethingsomething
2119isis
3119coming.coming
4119somethingsomething
5119hungryhungry
6119forfor
7119blood.blood
81110aa
91110shadowshadow
101110growsgrows
111110onon
121110thethe
131110wallwall
141110behindbehind
151110you,you
161110swallowingswallowing
171110youyou
181110inin
191110darkness.darkness
201111itit
211111isis
221111almostalmost
231111here.here
241111whatwhat
251111isis
261111it?it
271112whatwhat
281112ifif
291112it'sit's
301112thethe

Confirming that this worked:

show(df_split.:dialogue_stripped[1:10])
SubString{String}["something", "is", "coming", "something", "hungry", "for", "blood", "a", "shadow", "grows"]

Splendid.

Remove Stop Words

The next step is to get rid of stop words, because we don’t really care about counting those. There’s a list of stopwords in the Languages.jl package that we’ll use

stops = stopwords(Languages.English())
488-element Vector{String}:
 "a"
 "about"
 "above"
 "across"
 "after"
 "again"
 "against"
 "all"
 "almost"
 "alone"
 "along"
 "already"
 "also"
 ⋮
 "you'd"
 "you'll"
 "young"
 "younger"
 "youngest"
 "your"
 "you're"
 "yours"
 "yourself"
 "yourselves"
 "you've"
 "z"

Swell. Now that we have this, we can subset (filter in R terms) our dataset to include only rows with words not in the list of stop words.

dialogue_no_stops = subset(
    df_split,
    :dialogue_stripped => x -> .!in.(x, Ref(stops))
    )

50,812 rows × 5 columns

seasonepisodelinedialogue_splitdialogue_stripped
Int64Int64Int64StringSubStrin…
1119coming.coming
2119hungryhungry
3119blood.blood
41110shadowshadow
51110growsgrows
61110wallwall
71110swallowingswallowing
81110darkness.darkness
91112demogorgon?demogorgon
101113oh,oh
111113jesus,jesus
121113screwedscrewed
131113demogorgon.demogorgon
141114demogorgon.demogorgon
151115armyarmy
161115troglodytestroglodytes
171115chargecharge
181115chamber!chamber
191116troglodytes?troglodytes
201116toldtold
211116ya.ya
221118waitwait
231118minute.minute
241119hearhear
251120sound?sound
261121boom...boom
271121boom...boom
281122boom!boom
291123troglodytes.troglodytes
301124else.else

If you’re not familiar with Julia, the . is a way to broadcast/vectorize operations, which mostly aren’t vectorized by default. And to be completely honest, I’m not sure why I need to wrap our stopwords in Ref(), but the internet says I do and I assume this is some Julia equivalent of, like, tidyeval that I haven’t gotten around to understanding yet. But regardless, this does what we want!

Getting the Top 20 Words

We’re almost there, fam. We’ve got a dataset in the format we want it in, and we’ve done some light cleaning. Now, let’s count how often each word is used and select the top 20 most common. Again, we’re going to chain some operations together.

top_20 = @chain dialogue_no_stops begin
    groupby(:dialogue_stripped)
    combine(nrow => :count)
    sort(:count, rev = true)
    first(20)
end

20 rows × 2 columns

dialogue_strippedcount
SubStrin…Int64
11386
2yeah1106
3okay960
4oh670
5hey631
6shit456
7gonna427
8uh396
9mean310
10time284
11sorry281
12look242
13tell240
14mike234
15stop227
16maybe225
17please224
18max213
19god211
20little211

I’m actually not going to explain the above because I think it’s pretty intuitive if you’ve been following along so far and are familiar with either R or Python functions (the function names here are pretty descriptive, I think).

Plotting

Ok, so, as much as I like Julia so far, plotting does feel difficult. I’ve mostly used Makie and its counterparts, and I think I’m almost starting to get a handle on them, but they definitely don’t feel as intuitive to me as, say, ggplot2.

Full transparency – making this little plot took me more time than I wanted it to, and it’s entirely due to labeling the y-axis ticks. So, uh, here’s the code to make the plot, and just know that I don’t fully understand why some options accept vectors while others want tuples.

lbls = "Rank " .* reverse(string.(1:20))

barplot(
    1:nrow(top_20),
    reverse(top_20.count),
    direction = :x,
    bar_labels = reverse(top_20.:dialogue_stripped),
    flip_labels_at = median(top_20.count),
    axis = (
        yticks = (1:20, lbls),
        title = "Most Common Words in Stranger Things",
        xlabel = "Times Said"
    ),
)

Et voila – we’ve taken a dataframe with dialogue, tokenized it, cleaned it a little bit, and found the top 20 most common words. We could modify our list of stop words a little if we wanted to get rid of things like “oh”, “okay”, “uh”, and whatnot, but I’m not going to bother with that here. I hope you learned as much from reading this as I did from writing it!

Reuse

Citation

BibTeX citation:
@online{ekholm2022,
  author = {Eric Ekholm},
  title = {Stranger {Strings}},
  date = {2022-10-26},
  url = {https://www.ericekholm.com/posts/stranger-strings},
  langid = {en}
}
For attribution, please cite this work as:
Eric Ekholm. 2022. “Stranger Strings.” October 26, 2022. https://www.ericekholm.com/posts/stranger-strings.