Written by R package build
on November 2, 2022

Final Prediction

Overview

Over the course of the semester, I have been building models to forecast the outcomes of the 2022 US House Elections. With each week devoted to a different theme that can be considered predictive of an election’s outcome, I have had the opportunity to work with a variety of datasets and consider the best context for each variable.
In this post, I detail my three models:
1. Nationwide Seat Share
2. Nationwide 2-party Popular Vote Share
3. Vote Share in Ohio’s 1st District House Race

My final predictions are that Republicans will gain control over the seat, winning 23 new seats and taking a lead of 236-199 seats. However, the 95% confidence interval is large and captures a small probability of the Democrats maintaining control, or at least losing by a smaller margin. Similarly, I predict that Republicans will reach the majority of the nationwide popular vote at 52.5%. But, this point estimate has a 95% confidence interval of (49.7,52), suggesting that my model is not 95% confident that they will win the majority voteshare. Finally, for Ohio’s 1st district, I implement three different regression methods. Averaged together, they forecast that Incumbent Steve Chabot (R) will beat challenger Greg Landsman (D), but the 95% confidence interval again captures the possibility of Chabot losing the seat.

Nationwide Models

Dataset and Setup

For my nationwide models, I join a number of datasets with information from the 1948-2020 elections. My primary dataset is the voting information for each year’s election, where the unit of analysis is the party*year, including variables like: the party’s vote share, seat share, the party of the president after the election, and the incumbent party of the house majority. I then built variables for whether or not the party matches the incumbent president’s party, whether or not it is a midterm year, the party’s seat share and vote share from the previous election.

Next, I merged this with each party’s average generic ballot polling in the 52 days prior to the election. In generic ballot polls, the surveyor asks the likely voter which party they are planning to support in the upcoming election. These are performed at a nationwide-level and provide a pulse on each party’s favorableness. I also included economic data into my main dataset, incorporating the gdp growth percentage in the quarter prior to the election. Finally, I included the sitting president’s mean approval ratings into my analysis to test whether house candidates are penalized or rewarded for the performance of the president.

I chose to perform this at a nationwide, rather than district, level because I was not confident that I could capture the effects of recent redistricting. While redistricting certainly presents limitations in a nationwide model, I am running this analysis under the assumption that the effects of redistricting “wash out,” such that both parties will gain/lose a few seats.

Model Description, Justification

To build my forecasting models, I used a linear model to predict each party’s seatshare and voteshare on all 72 observations (36 years,2 parties) and the following variables:
1. Interaction term for whether it is a Midterm Year interacted with a binary for whether the party represented that of the incumbent president (SamePartyPres). Coefficient:kappa(im)
2. Interaction term for the presidential approval rating interacted with SamePartyPres. Coefficient: gamma(i)
3. Interaction Term for the GDP growth percentage in the last quarter interacted with SamePartyPres. Coefficient: rho(i).
I chose GDP growth percentage as opposed to RDI or inflation because GDP is highly linked to Inflation and can stabilize better across shocks.
4. The party’s generic ballot percentage. Coefficient: beta(1)
5. The party’s seatshare or voteshare from the prior election. Coefficient: beta(2) Omissions: Advertising data because the most important advertisements are run leading up to the elections, so it would be challenging to incorporate into my model. Similarly, I did not include demographic, voting administration, turnout data because I wasn’t sure if it would be particularly helpful at the nationwide level

Here are the exact formulas written out:

knitr::include_graphics("images/seatshare.png")

(#fig:seat_reg)Seat Share Equation

knitr::include_graphics("images/voteshare.png")

(#fig:vote_reg)Vote Share Equation

I split my dataset into two parts: training (70%) and testing (30%) so I could perform out of sample validation on my models.

Results and Interpretation

stargazer(lm_seats,lm_vote,type='text')

## 
## ==========================================================
##                                   Dependent variable:     
##                               ----------------------------
##                                   seats      majorvote_pct
##                                    (1)            (2)     
## ----------------------------------------------------------
## Inc_party_pres                  -82.476***     -8.512***  
##                                  (16.156)       (2.145)   
##                                                           
## MidtermYear                      14.309**       2.123**   
##                                  (5.314)        (0.788)   
##                                                           
## approval                        -0.757***      -0.084***  
##                                  (0.186)        (0.026)   
##                                                           
## GDP_growth_pct                   -1.220**       -0.111    
##                                  (0.514)        (0.071)   
##                                                           
## poll_pct                         1.928***      0.292***   
##                                  (0.475)        (0.065)   
##                                                           
## lag_seats                        0.648***                 
##                                  (0.071)                  
##                                                           
## lag_pv                                         0.401***   
##                                                 (0.101)   
##                                                           
## Inc_party_pres:MidtermYear      -27.759***     -4.322***  
##                                  (7.830)        (1.221)   
##                                                           
## Inc_party_pres:approval          1.496***      0.166***   
##                                  (0.308)        (0.041)   
##                                                           
## Inc_party_pres:GDP_growth_pct    2.042***        0.170    
##                                  (0.723)        (0.102)   
##                                                           
## Constant                         32.954*       21.296***  
##                                  (17.902)       (3.753)   
##                                                           
## ----------------------------------------------------------
## Observations                        51            51      
## R2                                0.923          0.858    
## Adjusted R2                       0.906          0.827    
## Residual Std. Error (df = 41)     11.263         1.570    
## F Statistic (df = 9; 41)        54.763***      27.584***  
## ==========================================================
## Note:                          *p<0.1; **p<0.05; ***p<0.01

As these results show, each variable is significant in predicting a party’s seatshare and voteshare at the 5% significance level. There are high Adjusted R-sq values, or proportion of the variance explained by the models, sitting at .923 for the seatshare model and .8258 for the voteshare model.

To interpret the coefficients, it is helpful to group together all the variables that start with “Inc_party_pres,” or whether the row’s party is the same as the president’s party. While the beta(1) on the “Inc_party_pres” is extremely large, this represents the case where the president has a 0% approval rating because it assumes that this term, Inc_party_pres:approval, is 0. Hence, this term proves useful in building the regression but is challenging to interpret on its own. Instead, I will focus on the interaction variables themselves.

Gamma(1), or Inc_party_pres:MidtermYear, represents the association between the incumbent president’s party’s outcome and the election occurring during a midterm year.During midterm years, the incumbent president’s party is associated with a 28 seatshare loss and a 4.3 percentage point drop in the popular vote, holding all the other variables in this regression constant. On the other hand, Gamma(0) (MidtermYear) represents the association between the non-incumbent party and the Midterm Year and is positive. During midterm years, the non-incumbent party is associated with 14 more seats and a 2.2 percentage point increase in voteshare. This follows the historical patterns, which show that midterms are often “a referendum on the incumbent president’s party,” and that the reps may be penalized for being in the same party as the sitting presidnet.

That’s not to say that the incumbent president’s party will only drop during midterm years. As the regression shows, there are positive coefficients for Gamma(1) and Rho(1), which respectively represent the associations between the incumbent president’s party and the president’s approval rating and the country’s gdp growth. More precisely, every 1 percentage point increase in a president’s approval rating is associated with an increase of 1.5 seats and .166 percentage points in voteshare. Similarly, every 1 percent increase in gdp growth from the past quarter is associated with an increase of 2 seats and .17 percentage points in voteshare. On the other hand, the party opposing the incumbent president is negatively associated with the presidential approval rating and the country’s gdp.

Finally, it is interesting to examine the non-interaction variables: the generic ballot polling Beta(2) and the lag Beta(3), or prior, seatshare and voteshare by party. As expected, all the variables are positively and significantly associated with the party’s electoral outcomes. For every 1 percentage point increase in the party’s generic ballot polling, the party’s increases by 1.93 seats and .292 percentage points in voteshare. I use the lagged variables as another form of an intercept that can hold account for where the party was prior to the election. I believe that it encapsulates more than a simple binary for control of the house. Each seat a party currently holds is associated with .65 seats in the upcoming eletion and each voteshare percentage point from the prior election is associated with a 0.4 voteshare in the upcoming election.

Model Validation

Next, I test how well my model performed for Republicans and Democrats in sample (training dataset) out of sample (testing dataset). I use root mean squared error, or the standard deviation of the residuals (differences between predictions and observed results).

model	party	test_rmse	train_rmse	index_rmse
Seatshare	Republican	13.094321	10.008748	12.06580
Seatshare	Democrat	12.302858	10.191017	11.59891
Voteshare	Republican	1.154010	1.500629	1.26955
Voteshare	Democrat	1.407398	1.304746	1.37318

I created a simple index that weights the testing rmse twice as much as that of training and then divides the result by 3. I chose to weight the testing rmse higher because it represents the out-of-sample prediction and therefore evaluated how the model performs on data the model hasn’t seen. These results first demonstrate that for my seat share model, the model performs better for Democrats. On the other hand, the voteshare model performs slightly better for the Republicans.

Predictions and Uncertainty

One limitation of the linear model is that the sum of the Democratic and Republican seat/voteshare may not sum up nicely to 100 or 435. As a work around, I predict the point estimate for Democrats’ voteshare using their model’s predicted value and voteshare for Republicans using 435-PredictionSeatShareDemocrats. An important note is that this model doesn’t include the likely possibility of a seat being held by a member of one of the non-major parties. Similarly, I predict the voteshare for Republicans and then compute 100-PredictionVoteShareRepublicans to get the predicted Democratic voteshare.

For each prediction, I draw a 95% confidence interval to be 95% certain that this range will capture the true mean of the seat/voteshare. Similar to my point estimates, I capture the Republican seatshare/Democratic voteshare upper and lower bounds by subtracting by the other party’s lower and upper bounds, respectively.

party	year	pred_seats_fit	pred_seats_lwr	pred_seats_upr	pred_vote_fit	pred_vote_lwr	pred_vote_upr
D	2022	198.8392	174.3883	223.2900	47.46914	44.06289	50.87539
R	2022	236.1608	211.7100	260.6117	52.53086	49.12461	55.93711

My model predicts that Democrats will lose their house majority, lose 21 seats, and land at 199 seats. For Republicans, the model forecasts that they will regain the house majority, gain about 24 seats and land at 236 seats. Similarly, the model forecasts that Republicans will gain the majority of the voteshare with 52.5% of the vote and Democrats will capture 47.5% of the vote.

However, these point estimates are only a starting point. The 95% confidence interval for each party’s seatshare includes a range of +/-20 from the estimate, signaling that the model is only 95% confident about the true value landing in a wide range.

As the graphs above show, the confidence intervals overlap for both voteshare and seatshare, so they see a very small probability in which Democrats win control of the house or receive over 50% of the voteshare.

Discussion:

Through these models - which are primarily based upon incumbent, economic, and polling data - I anticipate that Republicans will regain control of the House and capture the majority of the voteshare tomorrow. However, the confidence intervals are proportionally very large, so I am not as confident in the exact margins.

Ohio 1st District

Next, I zoom into my specific district: Ohio-01. This district is viewed as a toss-up according to Cook Political Report, with 538 predicting it’ll go Republican. The sitting incumbent, Steve Chabot, is a Republican who has served 15 terms and only lost once since 1995. Democrat and City Council member Greg Landsman is the challenger, and has recently been polling at about even with Chabot. My dataset for this portion includes all election data going back to 1948 for Ohio-01.

In 2020, Cook rated Ohio-01 as a “Tossup R” district and Chabot beat the captured 53.7% of the vote. However, the recent redistricting cycle has likely brought in more democrats to the district, making it a more competitive district. Cook still believes it’s a Tossup R again, signifying that they anticipate it will be a close race but that Chabot will win again. FiveThirtyEight similarly predicts that Chabot will win and capture 52.6% of the popular vote. However, their overlapping confidence intervals show that they see a path in which Landsman wins.By simulation, they anticipate that Chabot will win 82% of the time and Landsman the other 18%.

I run three separate models:

The first uses fundamental data only:

(#fig:fund_oh)Equation for Fundamentals Model

The next uses some fundamental data and Cook Partisan Voter Index. Each election cycle, Cook creates a prediction for each district’s partisan leaning, which “measures how each district performs at the presidential level compared to the nation as a whole.” The scale ranges from safe, solid, lean, toss-up in either party’s favor. In this coding, less than 0 means in favor for Republicans and greater than 0 means in favor for Democrats, with -4 being Safe R and 4 being Safe D.

(#fig:cook_oh)Equation for Model with Cook Ratings

The final regression uses just polling data from the last three cycles (2018-2022). In this model, I weight by the polling grade from FiveThirtyEight and the inverse amount of days until election (1/days). I chose not to use a sample size weighting because all the polls had a sample of 500.

(#fig:poll_oh)Equation for Polling Model

Fundamental and Cook Models

For the fundamental dataset, I performed a 70% training-30% testing data split to test in-sample and out-of sample errors. The in-sample rmse is 5.6 percentage points while the out-of sample rmse is 9.8.

With my model incorporating Cook ratings, I only had their ratings going back to 2008, so my dataset was very small. In order to avoid overfitting, I included all the data for my Cook-based model. The in-sample rmse was 1.74 percentage points.

Here is the regression output:

## 
## ===================================================================
##                                    Dependent variable:             
##                        --------------------------------------------
##                                            pv2p                    
##                                 (1)                    (2)         
## -------------------------------------------------------------------
## StatusIncumbent              12.269***                             
##                               (2.992)                              
##                                                                    
## prev                           0.202*                              
##                               (0.106)                              
##                                                                    
## same_party:MidtermYear         -2.306                -3.645        
##                               (2.958)                (2.311)       
##                                                                    
## code:partyD                                         2.376***       
##                                                      (0.686)       
##                                                                    
## code:partyR                                         -2.590***      
##                                                      (0.678)       
##                                                                    
## Constant                     34.759***              50.322***      
##                               (4.959)                (1.859)       
##                                                                    
## -------------------------------------------------------------------
## Observations                     50                    12          
## R2                             0.436                  0.913        
## Adjusted R2                    0.400                  0.881        
## Residual Std. Error       8.965 (df = 46)        2.932 (df = 8)    
## F Statistic            11.868*** (df = 3; 46) 28.072*** (df = 3; 8)
## ===================================================================
## Note:                                   *p<0.1; **p<0.05; ***p<0.01

As the first regression shows, incumbents perform significanty better in Ohio-01, and for every 1 percentage point increase in the party’s previous voteshare, they gain about .2 percentage points. The second regression reports a very high Adjusted R-sq at .88, but there are only 12 observations so I’m a bit cautious. As expected, Democrats are significantly better off when the Cook rating is above 0 and vise-versa for Republicans.

Polling model

Similar to my Cook-based model, I only had polling info going back to 2008, so my dataset was very small. In order to avoid overfitting, I included all the data points for my polling-based model.

The regression below demonstrates that in this small sample, there is a very strong, positive association between a party’s polling percentage and their voteshare. The in-sample rmse is 2.29 percentage points.

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                pv2p            
## -----------------------------------------------
## pct                          0.794***          
##                               (0.181)          
##                                                
## Constant                      13.982*          
##                               (8.216)          
##                                                
## -----------------------------------------------
## Observations                    32             
## R2                             0.391           
## Adjusted R2                    0.371           
## Residual Std. Error       3.092 (df = 30)      
## F Statistic           19.266*** (df = 1; 30)   
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

## # A tibble: 3 × 6
##   end_date   party   pct day_diff fte_grade pred_poll[,"fit"] [,"lwr"] [,"upr"]
##   <date>     <chr> <dbl> <drtn>   <chr>                 <dbl>    <dbl>    <dbl>
## 1 2022-10-16 R        46  23 days B/C                    50.5     44.1     56.9
## 2 2022-09-21 R        46  48 days B/C                    50.5     44.1     56.9
## 3 2022-05-15 R        47 177 days B/C                    51.3     44.9     57.7

In my 2022 predictions shown above, Chabot’s expected voteshare has gone down by about .8 percentage points as the polls tighten. I am not extremely confident in this model because the dataset is so limited. So, I will average the predictions for Republicans across all three models with the latter two being weighted twice as heavily because they can better account for redistricting.

In my fundamental model, I predict that Chabot will get 57.9% of the vote while I predict 52.9% and 51.3% in my Cook and poll-based models, respectively. Averaging across these three with weightings, I get that Chabot will get 53.2% of the popular vote while Landsman will get 46.6% of the popular vote. Similar to earlier results, the 95% confidence intervals are extremely large, so I am not very confident in these results and can imagine a pathway where Landsman captures the seat.

References:

https://projects.fivethirtyeight.com/2022-election-forecast/house/ohio/1/

James Campbell. Forecasting the 2018 US Midterm Elections. PS: Political Science & Politics, 51(S1), 2018. URL.

Olle Folke and James M. Snyder. Gubernatorial Midterm Slumps:. American Journal of Political Science, 56(4):931–948, October 2012. ISSN 00925853. URL.

← → Top