Tutorial: Intro to Web Scraping with R
In a recent post, we looked at how journalists may want to use web scraping to gather data, as well as some legal issues to consider when doing so. Now, we’ll take a look at how you can use R to scrape data from text in a standard form across many pages. We’ll start by going over some of the key functions that will help us, and then we’ll walk through an example.
- readlines: This function reads text data from a webpage into R as a text vector.
- grep and grepl: These two functions are used for pattern matching. Grep returns the indices of an object where a pattern is found. If, for example, a text vector of five lines contains the word “the” in the first and fourth lines, it will return the vector 1, 4. Grepl is very similar, but returns logical (true/false) values. In the previous example, grepl would instead return TRUE, FALSE, FALSE, TRUE, FALSE.
- strsplit: This function splits text wherever a specified pattern is found. This is useful in web scraping for breaking text up into the variables you are interested in.
- sub: As the name suggests, this function is used to substitute text patterns. You input a string of text, the pattern you want to replace, and the replacement, and the function returns a modified string.
- substring: This function takes a string input and returns a substring between the specified start and end character positions. For example, the command substring (“North Carolina”, 3, 5) would return “rth”.
- rbind: This is used to add rows to a data frame.
For my story on UNC basketball’s offensive rebounding, I scraped data from play-by-plays (like this one) on the official Carolina athletics website. The R script I wrote to do so is available in our Google Drive or on our github.
Using one game’s play-by-play as a model, I wrote a function to scrape all of the offensive rebounding data for that game. By writing a function, I could then use a simple loop to run the function for all games in a season.
The data that I wanted to gather was as follows: one row for each offensive rebound containing information about the opponent, shooter, rebounder, and type of shot. To do so, these were the steps I used:
- First, I created an empty data frame with the four columns that I wanted. When we get to the actual scraping later on, this makes it easier to add observations to the data.
- Next, I ran the readLines function on the url.
- To get the home and away teams, I used a series of grep, strsplit, and sub functions.
- Using the home and away team variables, I used two if statements to get only the data for UNC and exclude their opponents’ data. If UNC was the home team, I took the left side of each line and assigned the away team to a variable called “opponent.” If UNC was the away team, I did the opposite. Since the text lines were standardized to have the home and away teams information aligned in columns, I was able to cut the lines into the left or right side using the substring function.
- Using the grep function, I got a vector of which lines signified an offensive rebound. In the play-by-play, offensive rebounds were designated by “REBOUND (OFF).” However, because the left parenthesis is a special character in R, I had to use “REBOUND //(OFF)” as the search pattern input for the grep function. This has to do with regular expressions, which can be a bit confusing. Here’s a good resource if you want to learn more, but we won’t need any more regular expressions in this tutorial.
- Now that I had the line indices for all offensive rebounds, I used a “for loop” to gather and store the shooter, rebounder, and type of shot information for each offensive rebound. Because each line in the play-by-play for an offensive rebound is structured “REBOUND (OFF) by <player>,” I used the strsplit function (splitting on ” by “) on each line corresponding to an index returned in the previous step. The second element of the output of this function is then the name of the rebounder. I did the same thing for the shooter, but used the previous line in the play-by-play. I also used strsplit an additional time to get the type of shot. I then used the rbind function to add this information to my initial data frame.
For other data, you would need to rewrite this function to match the patterns in that data and to get different columns of information, but the basic principles remain the same. By learning only a handful of functions, you can write an R script to scrape large amounts of data.