Legislative Election Forecast Methodology
This project began with the idea of adapting an existing election prediction model (i.e. FiveThirtyEight, New York Times, etc.) to the North Carolina state legislative elections. However, these models are almost entirely based on poll data, and polls for state legislative races are few and far between, if they exist at all. This constraint meant any prediction model must be based on other existing data, and past election results and voter registration statistics seemed to be the best candidates.
Collecting and Cleaning the Data
I began creating the model by collecting past election results for all N.C. House of Representatives and N.C. Senate districts. Due to updates to state legislative district boundaries enacted in 2011, only the 2012 and 2014 elections had the same districts as those that will be used in the upcoming election. This data was easy to get from the North Carolina State Board of Elections results website.
Although past results are related to future results, results do vary from year to year, so I wanted to incorporate more data, specifically voter registration data. Because this data reflects changes in voter demographics between elections, I expect that a prediction using both results and registration will be more accurate than the results-only prediction.
The Board of Elections also has voter registration data available online. Unfortunately, the summary statistics available for each week since 2004 are aggregated by county, rather than legislative district. To get the statistics at the district level, I had to use the statewide voter registration file, which contains one row for each registered voter. I downloaded the file after it was updated on October 14, the deadline to register to vote on election day. Although same-day registration is available during early voting, I had to choose a cutoff point in order to analyze the data, and this seemed like the most logical. While this file contains the date each voter registered, I could not use this date to obtain the same statistics for 2012 and 2014 because this would exclude voters who were active at the time of those elections but became inactive and/or were removed from the file in the time since. Fortunately, I was able to find snapshots of this file from previous years. I downloaded the snapshots from January 1, 2013 and January 1, 2015 to find the statistics for the 2012 and 2014 elections.
Getting this data into a usable form took a little work. I used R to filter the 2016 file to include only active voters and then to count the number of registered voters by party (Democrat, Libertarian, Republican, unaffiliated) in each House and Senate district. I repeated the same filtering process on the other two files, but with one additional step to filter by date. I included only voters who registered before October 12, 2012 and October 10, 2014 (the registration deadlines in those years) from the 2013 and 2015 files, respectively.
Analyzing the Data
Now that I had all of this data, I needed to create the prediction model. After some experimenting, I eventually found a relatively strong linear pattern when plotting the 2014 election party results against the sum of the 2012 party vote percentage and the 2014 party registration percentage. When finding the regression line, I excluded all districts that were uncontested in the 2014 election so that the party results of 0% and 100% would not throw off the analysis. I used R to calculate the linear regression, which had a p-value of 2.2*10^(-16). In other words, the chance that the relationship between the variables is purely random is 0.000000000000022 percent. Next, I applied the regression model to the 2014 party vote percentage and the 2016 party registration to predict the 2016 party vote percentage. I classified each district into categories based on the predicted Republican vote percentage. There are 57 House and 15 Senate races in which candidates are running unopposed, so I made two separate categories for these races.