Getting into scrapes

The answer, my friend, is blowin’ in the wind. The answer is blowin’ in the wind
– Bob Dylan

Let’s get rvested: using webscraping to pull all the data from all the sites
webscraping
API
music
sentiments
ggplot2
rvest
Published

June 6, 2023

This post uses the rvest package to scrape the web for data on movies, Bob Dylan song lyrics, and untranslatable words.


IMDbest

As usual, we’ll start by replicating the R4Dev session, beginning by scraping the top 250 most popular movies according to IMDb.

Check out my code
library(tidyverse)
library(rvest)
library(RColorBrewer)

# Save the URL of the page we want ####
imdb_top_url <- "https://www.imdb.com/chart/top/"

# Scrape ! ####
imdb_top_250_raw <- imdb_top_url %>%
  # Read the page
  rvest::read_html() %>%
  # Pull chosen objects
  rvest::html_elements(".titleColumn a, .secondaryInfo, strong") %>%
  # Turn into text
  rvest::html_text() %>%
  # Turn into a tibble
  as_tibble()

imdb_top_250_raw %>%
  # Lets create a variable that identified each row type: Title, Year and Rating
  dplyr::mutate(category = rep(c("title","year","score"),times=n()/3)) %>%
  # Now we pivot the data to wider, but given that we have lots of variables with the same values (years, for instance), we have to do another trick by unnesting the lists that are created by default
  tidyr::pivot_wider(names_from = category, values_from = value, values_fn = list) %>%
  tidyr::unnest(cols = everything()) %>%
  # Now we can rename our variables and plot
  dplyr::mutate(rating = as.numeric(score)) %>%
  dplyr::slice_max(n=15,order_by = rating) %>%
  # plot
  ggplot(aes(x=title %>% reorder(rating),y=rating))+
  geom_point(color="gold2",size=8)+
  geom_text(aes(label=rating),size=3,color="black")+
  geom_text(aes(label=year,y=8.5),size=3,color="white",hjust=2)+
  coord_flip()+
  ylim(c(8,10))+
  labs(x=NULL,y="IMDb rating",
       title = "IMDb top 15 movies",
       subtitle="Popular  != Good")+
  theme_dark()+
  theme(axis.text.y = element_text(size = 12))

Live by no man’s code

Using the webscraping tools we practiced on IMdB, along with some tools from previous sessions, let’s do a sentiment analysis of Bob Dylan’s lyrics.

Bob Dylan has a catalogue of over 700 songs, the lyrics to which can be found on the website https://www.bobdylan.com/songs/. However, the website seems a bit incomplete, and only around 500 of these actually have lyrics on the site, so we’ll have to do a bit of filtering later to get the right set of words.

Check out my code
library(tidyverse)
library(rvest)
library(tidytext)

# Extract all song lyrics URLs ####
dylan_urls <- "https://www.bobdylan.com/songs" |>
  # First step as before
  rvest::read_html() |>
  # We select the element we want, which is the link to every song
  rvest::html_elements(".song a") |>
  # We extract the link to each song by picking an attribute, which for links is href
  rvest::html_attr("href") |>
  tibble::as_tibble() |>
  # We keep only one variable which is the URL of each song
  dplyr::transmute(url = value)

# Create an empty list and save all song lyrics using a loop ###
dylan_lyrics <- list()
for(i in dylan_urls$url) {

  dylan_lyrics[[i]] <- rvest::read_html(i) |>
    # Extract what we want
    rvest::html_elements(".lyrics") |>
    # Make it into text
    rvest::html_text() |>
    as_tibble() 
    
}

Now that the lyrics are all in a tibble, we can un-nest the elements, name them and bind them together, resulting in a dataset where the lyrics to each song are in their own row. Then we can proceed to the sentiment analysis process we used on books a few sessions ago, in which we tokenize by words, remove stopwords, and join our dataset to a sentiment lexicon (in this case we use AFINN). Then we feed the data into a plot to visualize the sentiment distribution of Dylan’s lyrics.

Check out my code
library(tidyverse)
library(tidytext)
library(textdata)

lyrics_clean1 <- dylan_lyrics |>
  # Identify each list with the url using imap
  purrr::imap(~mutate(.x, url=.y)) |>
  # Bind into a data frame all lyrics
  purrr::map_df(bind_rows) |>
  dplyr::filter(!str_detect(value,"Click")) |>
  # Unnest words
  tidytext::unnest_tokens(input = "value", output = "text", token = "words") |>
  # Remove stopwords
  dplyr::anti_join(stopwords::stopwords("en") |> as_tibble(), by=c("text"="value")) |>
  # Join sentiment lexicon
  dplyr::inner_join(tidytext::get_sentiments("afinn"), by=c("text"="word")) |>
  # Calculate most frequent word by sentiment lexicon
  dplyr::group_by(text) |>
  dplyr::mutate(word_freq = n()) |>
  dplyr::ungroup() |>
  # Summarise ####
  dplyr::group_by(value) |>
  dplyr::mutate(n = n(),
                p = n/sum(n),
                c = case_when(value<0 ~ "Negative",
                              value==0 ~ "Neutral",
                              value>0 ~ "Positive") |>
                  factor())

# Visualise with ggplot
lyrics_clean1 |> 
  ggplot(aes(x=value,y=n,fill=c, color=c))+
  geom_col()+
  # Notice how the data is filtered inside of the geom ####
  geom_text(data = . %>%
              dplyr::group_by(value) %>%
              dplyr::slice_max(n=1, order_by = word_freq) %>%
              dplyr::distinct(text, .keep_all = TRUE),
             aes(y=4300000, label=text), 
            angle=50, size=5)+
  geom_vline(xintercept = 0)+
  scale_x_continuous(breaks=seq(from=-5,to=5,by=1))+
  scale_y_continuous()+
  scale_fill_brewer(palette = "Set1")+
  scale_color_brewer(palette = "Set1")+
  labs(title = "Dylan lyrics: melancholy mood forever haunts me",
       subtitle = "Sentiment analysis based on ~500 song lyrics.\nMost frequent words by sentiment level at the top",
    x="Sentiment lexicon: AFINN",
    y= "Word frequency")+
  theme_void()+
  theme(legend.position = "none")

We can see that Dylan’s words trend negative, but more on the gloomy side of negative than downright terrible. The notable exception is level 2 (somewhat positive), where the most common word is “like”, appearing 613 times. But this seems like it could be a mis-categorization, particularly for an artist who uses similes as much as Dylan does. A quick scan of the songs including the word ‘like’ reveals that it is almost always used as a preposition rather than a verb (e.g. “Like a Rolling Stone”, “Just Like a Woman”, “there’s no success like failure”), and as such has no positive or negative connotation. So let’s try again, removing the word “like” along with the stopwords. While we’re at it we can tweak some aesthetics of the labels.

Check out my code
library(tidyverse)
library(tidytext)
library(textdata)

lyrics_clean <- dylan_lyrics |>
  # Identify each list with the url using imap
  purrr::imap(~mutate(.x, url=.y)) |>
  # Bind into a data frame all lyrics
  purrr::map_df(bind_rows) |>
  # Get rid of pages that just have an invitation to click to see list of live performances instead of lyrics
  dplyr::filter(!str_detect(value,"Click")) |>
  # Unnest words
  tidytext::unnest_tokens(input = "value", output = "text", token = "words") |>
  # Remove stopwords(and "like")
  dplyr::filter(!text == "like") |> 
  dplyr::anti_join(stopwords::stopwords("en") |> as_tibble(), by=c("text"="value")) |>
  # Join sentiment lexicon
  dplyr::inner_join(tidytext::get_sentiments("afinn"), by=c("text"="word")) |>
  # Calculate most frequent word by sentiment lexicon
  dplyr::group_by(text) |>
  dplyr::mutate(word_freq = n()) |>
  dplyr::ungroup() |>
  # Summarise ####
  dplyr::group_by(value) |>
  dplyr::mutate(n = n(),
                p = n/sum(n),
                c = case_when(value<0 ~ "Negative",
                              value==0 ~ "Neutral",
                              value>0 ~ "Positive") |>
                  factor())


# Visualise with ggplot
lyrics_clean |> 
  ggplot(aes(x=value,y=n,fill=c, color=c))+
  geom_col()+
  # Filter data within geom to create labels with top word in each category ####
  geom_text(data = . %>%
              dplyr::group_by(value) %>%
              dplyr::slice_max(n=1, order_by = word_freq) %>%
              dplyr::distinct(text, .keep_all = TRUE),
             aes(label = text),
            vjust = c(-.8,-.8,-7,-35.7,-12.3,-24.6,-16.5,-10.2,-.8,-.8), colour = "gray19")+
  geom_vline(xintercept = 0)+
  scale_x_continuous(breaks=seq(from=-5,to=5,by=1))+
  scale_y_continuous(limits = c(0,4000000))+
  scale_fill_brewer(palette = "Set1")+
  scale_color_brewer(palette = "Set1")+
  labs(title = "Bob Dylan's lyrics: World Gone Wrong",
       subtitle = "Sentiment analysis based on ~500 songs, labeled with most frequent words by sentiment level",
    x="Sentiment lexicon: AFINN",
    y= "Word frequency")+
  theme_void()+
  theme(legend.position = "none")

The trend towards melancholy lyrics shows up even more clearly here, and the Bard’s favorite gloomy word (showing up a whopping 95 times in the Dylan discography) is wrong. To quote the title track of his 29th studio album: “I can’t be good, baby / Honey, because the world’s gone wrong.”

However, while wrong is the top word of the top category, it’s not actually the word that shows up the most often in his lyrics. The next graph shows the top 10 most frequent words in the Dylan dataset (again excluding stopwords). It turns out that despite his general trend towards the lugubrious, Dylan’s number one word is love.

Check out my code
library(tidyverse)
library(tidytext)
library(textdata)

lyrics_clean |>
  ungroup() |> 
  dplyr::distinct(text, .keep_all = TRUE) |> 
  dplyr::slice_max(n=10, order_by = word_freq) |>
  ggplot(aes(x=reorder(text, word_freq),y=word_freq,fill=word_freq))+
  geom_col(show.legend = FALSE)+
    coord_flip()+
    theme_classic()+
    scale_fill_distiller(palette="YlGnBu", direction=1)+
    # Labels
    labs(x = NULL, y = "Frequency",
       title = "The Words of a Generation",
       subtitle = "Top 10 most frequent words in Bob Dylan's lyrics")

I’d like to scrape some words, do Eunoia guy?

The website Eunoia purports to contain over 700 “untranslatable” words from 80+ languages, but only shows 30 at a time. In this section we’ll use a for loop to scrape the full word list. At first I tried looping until there were 700 unique words (using the code “while (length(unique(eunoia_list$Word)) < 700))”), but that was apparently a never-ending loop. Instead, I used a trial-and-error method of running it an increasing number of times until the number of unique words stopped increasing. I also ran a few spot checks by comparing the number of words for a given language on the website vs in my dataset.

Check out my code
# Load the required packages
library(rvest)

# Define the URL of the website
eunoia_site <- "https://eunoia.world/"

# Create empty tibble
eunoia_list <- tibble(word = character(), definition = character(), language = character())

# Loop a crazy number of times until we max out the number of unique words


for (i in c(1:300)) {
  eunoia_list <- eunoia_list %>%
    dplyr::bind_rows(
      read_html(eunoia_site) %>%
        rvest::html_elements("td:nth-child(1), td:nth-child(2), td:nth-child(3)") %>%
        rvest::html_text() %>%
        matrix(ncol = 3, byrow = TRUE) %>%
        as_tibble() %>%
        setNames(c("word", "definition", "language"))
    ) %>%
    dplyr::distinct(word, .keep_all = TRUE)
}

# Check total words
dplyr::n_distinct(eunoia_list$word)
[1] 692

The highest number of words I was able to find was 692, even when I increased the number of batches to 1000 (that one took a while to run). Not quite “700+”, but not too far off. Further testing revealed that we can reliably get all 692 words by running the loop 300 times, so that is the number we are sticking with.

To finish off this session, let’s summarize our word list by source language…which languages are the most untranslatable?

Check out my code
library(ggplot2)
library(ggtext)

eunoia_list |> 
  group_by(language) |> 
  dplyr::mutate(n = n()) |>
  ungroup() |> 
  dplyr::distinct(language, .keep_all = TRUE) |> 
  dplyr::slice_max(n=10,order_by = n) |> 
  ggplot(aes(x=language |> reorder(n),y=n, fill = n))+
    geom_col(show.legend = FALSE)+
    coord_flip()+
    geom_text(aes(label = n, hjust = -.2), colour = "gray19")+
    # Labels
    labs(x = NULL, y = "# of words",
       title = "Unübersetzbare Sprachen",
       subtitle = "Top 10 Most Represented Languages on Eunoia.com")+
    scale_fill_distiller(palette = "YlGnBu", direction = 1)+
    theme_classic()

Looks like German (aka the Lego playset of languages) is the most represented in the dataset, with 79 delightful words such as Schnapsidee ~ An idea so stupid the person must have thought of it while drunk, and Krawattenmuffel ~ Someone who’d rather not wear a tie.

In the next post we’ll build a Shiny app to interact with the Eunoia database.