Pulling YouTube Transcripts

Example of pulling transcripts for an entire YouTube playlist.

EE https://www.ericekholm.com/
05-15-2020

I’ve been a fan of the Your Mom’s House Podcast for a long time now, and I thought it would be interesting to do some analysis of their speech patterns. If you follow the show at all, you know that the conversations are…special (you can check here for a visualization I did of their word usage over time if you’re so inclined). Fortunately, it’s possible to get transcripts of YouTube videos. Getting transcripts for a single video using the {youtubecaption} R package is fairly straightforward; getting transcripts for a full playlist is a touch more involved, so I wanted to create a quick walkthrough illustrating my process for doing this. Hopefully this will help others who might want to analyze text data from YouTube.

Setup

First, let’s load the packages we need to pull our data. I’m going to use the following:

Getting Transcripts for a Single Video

Like I mentioned previously, getting transcripts for a single video is pretty easy thanks to the {youtubecaption} package. All we need is the URL for the video and the get_caption() function can go do its magic. I’ll illustrate that here using the most recent YMH podcast full episode.

ymh_new <- get_caption("https://www.youtube.com/watch?v=VMloBlnczzI")
glimpse(ymh_new)
Rows: 3,157
Columns: 5
$ segment_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,...
$ text       <chr> "this episode of your mom's house is", "brough...
$ start      <dbl> 0.000, 1.140, 3.659, 7.859, 8.910, 14.820, 20....
$ duration   <dbl> 3.659, 6.719, 5.251, 6.961, 11.879, 9.080, 3.1...
$ vid        <chr> "VMloBlnczzI", "VMloBlnczzI", "VMloBlnczzI", "...

We can see above that this gives us a tibble with the text (auto-transcribed by YouTube) broken apart into short segments and corresponding identifying information for each text segment.

One thing worth mentioning here is that the transcripts are automatically transcribed by a speech-to-text model. It seems really good, but it will make some mistakes, particularly around brand names and website addresses (in my limited experience).

Getting Transcripts for Several Videos

But what if we want to get transcripts for several videos? The get_caption() function requires the URL of each video that we want to get a caption for. If you want to analyze transcripts from more than a handful of videos, it would get really tedious really quickly to go and grab the individual URLs. And, more specifically, what if you wanted to get the transcripts for all videos from a single playlist?

Get URLS

I found this tool that will take a YouTube playlist ID and provide an Excel file with, among other information, the URL for each video in the playlist, which is exactly what we need for the get_caption() function.

I used the tool on 5/14/20 to get a file with the data for all of the videos in the YMH Podcast - Full Episodes playlist. I’ll go ahead an upload the file, plus do some light cleaning, in the code below.

ep_links <- read_csv("~/Data/YMH/Data/ymh_full_ep_links.csv") %>%
  clean_names() %>%
  mutate(ep_num = str_replace_all(title, ".*Ep.*(\\d{3}).*", "\\1") %>%
           as.double(),
         ep_num = if_else(ep_num == 19, 532, ep_num),
         published_date = mdy_hm(published_date),
         vid = str_replace_all(video_url, ".*=(.*)$", "\\1"))
glimpse(ep_links)
Rows: 223
Columns: 7
$ published_date <dttm> 2020-04-29 12:03:00, 2020-04-22 12:00:00,...
$ video_url      <chr> "https://www.youtube.com/watch?v=xw3KNj2yw...
$ channel        <chr> "YourMomsHousePodcast", "YourMomsHousePodc...
$ title          <chr> "Your Mom's House Podcast - Ep. 549", "You...
$ description    <chr> "Want an ad-free experience? Click here to...
$ ep_num         <dbl> 549, 548, 547, 546, 545, 544, 543, NA, 542...
$ vid            <chr> "xw3KNj2ywVo", "_BVQvqPvu-8", "HvueqYO--tc...

We can see that this gives us the URLs for all 225 episodes in the playlist.

The cleaning steps for the published_date variable and the vid variable should be pretty universal. The step to get the episode number extracts that from the title of the video, and so this step is specific to the playlist I’m using.

“Safely” Pull Transcripts

Now that we have all of the URLs, we can iterate through all of them using the get_caption() function. Before we do that, though, we want to make the get_caption() robust to failure. Basically, we don’t want the whole series of iterations to fail if one returns an error. In other words, we want the function to get all of the transcripts that it can get and let us know which it can’t, but not to fail if it can’t get every transcript.

To do this, we just wrap the get_caption() function in the safely() function from {purrr}.

safe_cap <- safely(get_caption)

You can read more about safely() in the {purrr} documentation, but it basically returns, for each call, a 2-element list: 1 element with the “result” of the function and another with the “error.” If the function succeeds, “error” will be NULL and “result” will have the result of the function. If the function fails, “result” will be NULL and “error” will show the error message.

Now that we have your safe_cap() function, we can use map() from {purrr} to pull transcripts from all of the videos we have URLs for.

ymh_trans <- map(ep_links$video_url,
                 safe_cap)
glimpse(head(ymh_trans))
List of 6
 $ :List of 2
  ..$ result: tibble [2,663 x 5] (S3: tbl_df/tbl/data.frame)
  ..$ error : NULL
 $ :List of 2
  ..$ result: tibble [3,093 x 5] (S3: tbl_df/tbl/data.frame)
  ..$ error : NULL
 $ :List of 2
  ..$ result: tibble [3,727 x 5] (S3: tbl_df/tbl/data.frame)
  ..$ error : NULL
 $ :List of 2
  ..$ result: tibble [2,701 x 5] (S3: tbl_df/tbl/data.frame)
  ..$ error : NULL
 $ :List of 2
  ..$ result: tibble [3,276 x 5] (S3: tbl_df/tbl/data.frame)
  ..$ error : NULL
 $ :List of 2
  ..$ result: tibble [3,382 x 5] (S3: tbl_df/tbl/data.frame)
  ..$ error : NULL

Format Data

This returns a list the same length as our vector of URLs (225 in this case) in the format described above. We want to get the “result” element from each of these lists. (You might also be interested in looking at the errors, but any errors are all going to be the same here – basically that a transcript isn’t available for a specific video). To do that, we want to iterate over all elements of our transcript list (using map() again) and use the pluck() function from {purrr} to get the result object. We then used the compact() function to get rid of any NULL elements in this list (remember that the “result” element will be NULL if the function couldn’t get a transcript for the video). This will give us a list of transcripts that the function successfully fetched.

Next, we use the bind_rows() function to take this list and turn it into a tibble. And finally, we can inner_join() this with our tibble that had the URLs so that metadata for each video and transcripts are in the same tibble.

res <- map(1:length(ymh_trans),
           ~pluck(ymh_trans, ., "result")) %>%
  compact() %>%
  bind_rows() %>%
  inner_join(x = ep_links,
            y = .,
            by = "vid")
glimpse(res)
Rows: 437,098
Columns: 11
$ published_date <dttm> 2020-04-29 12:03:00, 2020-04-29 12:03:00,...
$ video_url      <chr> "https://www.youtube.com/watch?v=xw3KNj2yw...
$ channel        <chr> "YourMomsHousePodcast", "YourMomsHousePodc...
$ title          <chr> "Your Mom's House Podcast - Ep. 549", "You...
$ description    <chr> "Want an ad-free experience? Click here to...
$ ep_num         <dbl> 549, 549, 549, 549, 549, 549, 549, 549, 54...
$ vid            <chr> "xw3KNj2ywVo", "xw3KNj2ywVo", "xw3KNj2ywVo...
$ segment_id     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
$ text           <chr> "oh snap there's hot gear merge method", "...
$ start          <dbl> 0.030, 4.020, 6.629, 12.450, 14.730, 17.40...
$ duration       <dbl> 6.599, 8.430, 8.101, 4.950, 4.530, 5.600, ...

Hopefully this helps folks & best of luck with your text analyses!

Citation

For attribution, please cite this work as

EE (2020, May 15). Eric Ekholm: Pulling YouTube Transcripts. Retrieved from https://www.ericekholm.com/posts/2021-01-06-pulling-youtube-transcripts/

BibTeX citation

@misc{ee2020pulling,
  author = {EE, },
  title = {Eric Ekholm: Pulling YouTube Transcripts},
  url = {https://www.ericekholm.com/posts/2021-01-06-pulling-youtube-transcripts/},
  year = {2020}
}