Tutorial: Using R to Generate Written Summaries of Data
In this tutorial, we’re going to take election results data from the North Carolina General Assembly 2016 election and use R to generate a short summary of each district. If you have never used R, here is a basic introduction that covers installing R and RStudio, as well as some simple commands. You can download the data set and R script file that I used in our Google Drive or on our github. The original data set came from the State Board of Elections website. I did some preliminary manipulations (i.e., restricting to only General Assembly elections) in SQL and Excel to create the “election16candidates.csv” file before importing this file into R for additional work. If you want to skip the cleaning and manipulation of the data, you can download the “election16candidates_clean.csv” version and go straight to the sentence-writing part.
Cleaning and Manipulating the Data
After reading in the file, I performed a few manipulations to make the data easier to use and to calculate some values that we want in our summary output. We want our summary to include the winning candidate’s name and party, their margin of victory (when applicable), other candidates who ran in that district, and what it would’ve taken for the district to flip parties. Here are a few of the functions I used to do that:
- is.na() : This function tests whether a cell in a dataframe is empty (has value NA) and returns true or false. I used this to replace NA values with zeros (in columns of numbers) or “None” (in columns of text). Putting this function inside square brackets ensures that the replacement value is only assigned to cells where is.na() returns true.
- pmax() : This function is the parallel maximum function, meaning that it returns the maximum value in each row of the input columns. I used this to create a new column containing the number of votes received by the winning candidate.
- ifelse() : This function takes a logical test as its first input, an action in the case that it is true as its second input, and an action for the false case as its third input. Using this function, I created a column containing the winning party in each district.
- round() : Just as the name suggests, this function rounds numbers to a specified number of decimal places.
- fractions() : This function is in the MASS package, which means that you must install the package using the command install.packages(“MASS”) and then run the command library(MASS) to load the package before you can use it. (Note: packages only need to be installed once, but must be loaded with library() every time you start a new R session.) This takes a decimal value and gives its value as a fraction. With a little additional manipulation from a post found on StackOverflow, I used this to store numerator and denominator values in separate columns.
Now that we’ve got that out of the way, we’re ready for the fun part. We want our output to be a function of two inputs: the legislature (House or Senate) and the district number. This first line of this function sets its name and inputs. We then need a series of if/else statements to set up our function output how we want it.
The first if/else statement inside the function body sets the index value based on the input district. Because the data contains 170 rows — the 120 House districts followed by the 50 Senate districts — the data for the first Senate district lies in the 121st row. To compensate for this, we just need to add 120 to the district number when the input district is a senate district. Note the use of the toupper() function. This function converts all letters to uppercase so that “senate”, “Senate”, “seNaTe”, “SENATE”, and all other capitalization variations will all be caught by the if statement. If we had used tolower()==”senate”, we would have achieved the same effect.
The remainder of the function body code consists of three more if/else blocks that create a summary matching the results of the district. This might have been the trickiest part because it required some careful thought (and trial and error) to get the right conditions for the possible scenarios. In each of these, we need to use the cat() function to concatenate text and data values. For data values, we type the name of the desired column and use the index [i] to include the value from only the row corresponding to the input district.
The second if/else block sets up three different cases main cases. We want a different structure of our summaries for uncontested races, races with both a Democratic and Republican candidate (regardless of whether there is a third party candidate or not), and races with either a Democratic or Republican candidate (but not both) and a third party candidate. In an uncontested race, the winning candidate’s votes will equal the total of all candidates’ votes. We also must consider the reverse logic to ensure that this test will not catch any contested districts. If only one candidate receives any votes in a district, does this necessarily mean that no other candidates ran? While it is theoretically possible for a candidate to run and receive no votes, I think we are pretty safe in assuming that every candidate running receives at least one vote. To create the second case, we need an else if statement, rather than just an else statement. Because we set up the “Loser” column to only include the loser between the two major parties, we can identify districts containing candidates from both parties by checking whether the “Loser” column is not equal to “None.” This case will include districts with only a Democrat and a Republican, as well as districts with three candidates. The third and final case will cover all remaining districts, which will be districts that were contested between a candidate from one of the two major parties and a third-party candidate.
The next if/else block adds a sentence if there is a third party candidate in that district with that candidate’s name, party, and number of votes received. I did not include this in the previous if/else block because we want this included in the summary for cases two and three (when applicable) in that block. The if statement ensures that this sentence is printed only when there is a third-party candidate who did not win the election. If a third-party candidate won, this sentence would be redundant after the previous sentence printed with the information about the winning candidate. Although there were no third-party winners in 2016, this allows the function to be applied to other years in the past or future.
The fourth and final if/else block is probably the most interesting and is where all the fraction calculations we did previously finally come in. Because third party candidates rarely win elections and because the two main parties are the ones vying for control of the General Assembly, I chose to include this sentence only when candidates from both the Democratic and Republican parties had candidates running. I derived the two numbers in this sentence as the last step of the data manipulation phase. I first divided the number of votes needed to win the election (equal to 1 + the number of votes received by the winning candidate) by the number of votes received by the losing candidate. I then rounded this number to two decimal places, converted it to a fraction, and extracted the numerator and denominator. While I could’ve used more decimal places and gotten more accurate numbers, the goal was to make the data more relatable on a personal level, and it is easier to conceptualize 25 or 100 people than 5,000.
Once we have finished writing our summarizing function, we need to run it so that R will store it as a callable function. Once this is done, we can run some commands, and R will give us the summaries:
DEM candidate Joel Ford won this district by 49091 votes over REP candidate Richard Rivette. In order for Richard Rivette to win, every 50 people who voted for them needed to bring an additional 139 voters.
DEM candidate Yvonne Lewis Holley won this district with 28582 votes and did not face a REP candidate. LIB candidate Olen Watson III also ran, recieving 5125 votes.
REP candidate Jimmy Dixon won in an uncontested election.