Replicating Voter Turnout Analysis at County Level
Last November’s divisive election raised many questions for reporters and citizens alike. In a recent release, Democracy North Carolina analyzed voter turnout in the state for the November 2016 election among different demographic groups. In this post, we’ll see how you can replicate this analysis for your county.
The first step in any data analysis is finding the correct data. In order to calculate voter turnout for different demographic groups, we will need two sets of data: one containing the number of registered voters in each group at the time of the election and a second containing the number of people in each group who actually voted.
The good news is that the Democracy North Carolina piece contains a link to the raw data at the North Carolina State Board of Elections website. The bad news is that the link goes to a page with several data sets, and it is not immediately obvious which ones we need. There are a bunch of election dates listed, each of which links to a page with several data sets, as well as some layout files at the bottom of the page. These layout files are our ticket. Each of these layout files contains the metadata for the corresponding files within the election folders and describes the purpose, field names, and format of the file. From this information, we can see that history_stats files contain voter history statistics (i.e. people who voted) and voter_stats files contain registration statistics. If we then go into the 2016_11_08 folder, we can download these two files containing all the data we need.
There are three basic steps to perform in order to get the voter turnout summaries that we want: filtering, grouping, and summing. Filtering is the only step not used in the Democracy North Carolina analysis.
These three steps can be performed with several different tools, and I’ll show you two: SQL and R. The SQL method is more straightforward, but it requires much more repetition and takes more time. The R method is slightly less intuitive but faster and less tedious. It took me more time up front to write the code in R, but is now much easier to repeat than the SQL method. My code is available here, so you can skip all the up-front time (if you trust me) and go straight to the output.
The SQL Method
To perform this analysis using SQL, you will need to run a series of commands of the following form:
SELECT <demographic_variable>, SUM(total_voters) FROM voterstats11xx08xx2016 WHERE county_desc = '<County_Name>' GROUP BY <demographic_variable>
This query will return the number of registered voters in each group within the selected demographic variable. To get the number of people in those groups who actually voted, you would just repeat that query with historystats11xx08xx2016 instead of voterstats11xx08xx2016.
You can do the same for multiple demographic variables at once. For example, the following query will return the number of people who voted in each combined subgroup of race and party (i.e. the number of male and female Republicans, male and female Unaffiliated, etc).
SELECT race_code, party_cd , SUM(total_voters) FROM historystats11xx08x2016 WHERE county_desc = 'County_Name' GROUP BY race_code, party_cd
The problem with this approach is that to get each different demographic group, you will have to run a lot of commands. Another problem is storing the output of these queries. While I am sure there is a better way, with my limited SQL knowledge I only know to copy and paste the output into an Excel document. As you can probably tell, this gets old pretty quickly, especially if you do it for one county and then decide you’d like to see it for another county. Due to these limitations, I prefer to do this analysis in R.
The R Method
In R, it is much easier to define functions and save results as variables. In my analysis, I chose to define two functions: one that returns the number of people registered or voted in each group in a particular demographic, and a second that runs the first function for all demographics and returns the turnout for each. You can download the R script file containing my functions here.
The first function performs nearly the same task as a single SQL statement and takes two inputs: a data set and whether the data is voter history or registration data. As in SQL, it groups, sums and selects to manipulate the data and return the grouped totals. It does not contain an equivalent to the SQL WHERE statement to filter the data, as I chose to perform this step in the second function. Because the second function uses this first function, the data input of the second function will be filtered when necessary. I used the tidyverse package to perform these tasks. A great explanation of these functions is available here.
The second function takes four inputs: voter history data, voter registration data, a list of counties, and a true/false value for whether the output should be filtered. If the last two inputs are not specified, the function will run for all counties and set filtered to true. If the output is filtered, the function will only return the 48 specific demographic subgroups contained in the Democracy North Carolina table. Otherwise, the function will return statistics for over 200 demographic subgroups.
This second function filters the voted and registered data to the selected county or counties. It then runs the first function on both data sets for each demographic variable and some combinations of demographic variables. It combines each output into a single data frame and calculates the voter turnout for each group.
Running the R Functions
If you downloaded my R script and want to run it, there are only a couple things you need to do. First, download the voter stats and history stats files from the NC State Board of Elections site. As long as the data format is the same, the function will also work for previous years.
Next, open the demographicTurnout.R file in R or RStudio and run it. In RStudio, you can do this with Cmd/Ctrl + Alt + R or by clicking Source. Running the document will store the function definitions so that you can use them.
Next, you need to import the voter registration and history files into R using read_tsv. These commands are included in lines 13-14 of my R script as comments, but you will need to update the file path once you download the files so R can find them. The final commands may look something like this:
voterstats <- read_tsv("/Users/username/Downloads/voterstats11xx08xx2016.txt") historystats <- read_tsv("/Users/username/Downloads/historystats11xx08xx2016.txt")
Now you need only one line of code to return the voter turnout for various demographic subgroups. If, for example, you want to see voter turnout breakdowns in Orange County, you can run the following command:
demographicTurnout(historystats, voterstats, counties = "Orange")
You can also see the turnout for multiple counties. It is also a good idea to store the output of this function as a variable so that you can later analyze or export it. That command would look like this:
two_county_turnout <- demographicTurnout(historystats, voterstats, counties = c("Guilford","Forsyth"))
You can then save this variable to your computer using the write.csv function.
write.csv(two_county_turnout, "/Users/username/Documents/Guilford and Forsyth Turnout.csv")