Saturday, September 10, 2016

Trump for President? Aggregating national polling data

A few weeks ago, just after the Democratic convention, Hillary’s poll numbers shot up. A week before that, Trump’s shot up by a similar degree, before falling. At time of writing, Hillary’s numbers have eased a few percent. It all seems a bit jumpy. Do people really change their mind so quickly only to reverse it a few days later? Or are we just over-interpreting noise? And which polls should we believe?
These are fun statistical questions, and so one morning on the subway to work I (Jim) put together a very simple little model to aggregate polls. The idea was extremely simple: by themselves, polls are noisy measures of some unobserved truth (the true distribution of political preferences). They are noisy partly because of sampling variation, partly because of biases in questioning or methodology, and perhaps even because respondents get swept up in the moment and give answers that mightn’t reveal their true preferences.
It makes sense to smooth out the noise. But how? I proposed a very simple state-space model of political preferences where on day , the preference share for candidate  would evolve according to a random walk with normal innovations. This gives us a very simple model of the unobserved state.


What we do observe are noisy measurments of the state—the polls, conducted by polling firm , expressed in a proportion 


A more sophisticated model would introduce a bias term in the mean of the measurement model to control for any systematic bias in the polls.
In putting together the model, my point was simply that the modeller can impose a small value of  (or very tight priors) and significantly smooth out the temporal variation in the polls, while capturing most of the long-run variation. And Stan makes it easy enough to do on the subway. You can see the scripts here and a discussion on Gelman’s blog here (the comments on the blog are superb).

Is this a good model?

Of course, the model I put together is insane. It has a random walk state, which is a fancy way of saying that we believe that the vote preference shares are unbounded. There is no polling firm bias. No controlling for bias in the sample. And there is no attempt at all to capture any predictability in the vote shares or their dynamics. We should do better.

Enter, Trangucci

A few of the comments on the blog made the point that since the number of days is pretty small, we might be able to give Gaussian Process priors to the time-varying component of preference shares, rather than the random walk prior. Rob Trangucci, a mate of mine on the Stan team, had some spare time and thought that this would be a good approach. Gaussian processes have the attractive property that they provide more certainty around the more commonly observed points, and more uncertainty away from them. They are also fantastic at capturing dynamics. They make these improvements at the cost of speed.
A Gaussian process on a time-series variable  can be thought of as the entire time-series being a single draw from a multivariate normal distribution (a Gaussian). The enormous covariance matrix is of course unidentified, so we impose constraints on its elements with a covariance function  using a much smaller number of parameters . We construct the time-series variable  to be related to candidate ’s preference share so that when  is higher, all else equal, so too will be candidate ’s preference share.


By itself, the Gaussian Process prior doesn’t impose boundedness on the preference share; we want a mapping between the variable  and . To achieve this, Rob suggested modeling the survey responses themselves. Where  is a vector of sums of survey responses conducted on date  by polling firm  (so yt,p might look like (500,400,100responses for Hillary, Trump and Other, respectively), the model would look like:

which we can implement using a softmax transformation


Where the softmax function takes a vector of unbounded values , the latent variable that relates to the preference share distribution in , and returns a simplex (a vector of values in (0, 1) that sum to 1).  is our preference share. By assumption, biases are zero in expectation.
Rob models the bias parameter as being a the sum of of bias from the polling firm, and bias from the poll itself, both being multivariate normally distributed.

Implementing the model

Rob’s model is implemented below. The Stan code for the model is:
// Saved as MRT/toy_hill_don.stan
data {
  // define data
  int<lower=1> T; // number of unique dates
  int<lower=1> N; // n polls
  int poll_numbers[N,3];
  int time[N];
  int poll_house[N];
  int n_house;
}
parameters {
  vector[T] don_day_eta;
  vector[T] hill_day_eta;
  matrix[2,n_house] house_eta;
  vector<lower=0>[2] house_sd;
  matrix[2,N] poll_eta;
  vector<lower=0>[2] poll_sd;
  real<lower=0,upper=1> hill_phi;
  real<lower=0,upper=1> don_phi;
  real<lower=0> hill_sd;
  real<lower=0> don_sd;
  real hill_mu;
  real don_mu;
  cholesky_factor_corr[2] L_poll;
  cholesky_factor_corr[2] L_house;
}
transformed parameters {
  vector[3] u_pars[T];
  vector[3] u_pars_poll[N];
  vector[T] hill_day_re;
  vector[T] don_day_re;
  matrix[2,N] poll_re_km1;
  matrix[2,n_house] house_re_km1;
  real hill_sq;
  real don_sq;

  
  hill_sq = hill_sd / (1.0 - pow(hill_phi,2.0));
  don_sq = don_sd / (1.0 - pow(don_phi,2.0));
  {
    matrix[T,T] hill_day_cov;
    matrix[T,T] don_day_cov;
    
    // fill in the covariance matrices. We are using the VCOV matrix from an AR(1) model
    for (k in 1:(T - 1)) {
      for (m in (k+1):T) {
        hill_day_cov[k, m] = pow(hill_phi,m-k) * hill_sq;
        hill_day_cov[m, k] = hill_day_cov[k, m];
        don_day_cov[k, m] = pow(don_phi,m-k) * don_sq;
        don_day_cov[m, k] = don_day_cov[k, m];
      }
      hill_day_cov[k, k] = hill_sq;
      don_day_cov[k, k] = don_sq;
    }
    
    // fill in bottom corner of covariance matrices
    hill_day_cov[T, T] = hill_sq;
    don_day_cov[T, T] = don_sq;
    
    // Clinton daily effects
    hill_day_re = hill_mu + cholesky_decompose(hill_day_cov) * hill_day_eta;
    
    // Trump daily effects
    don_day_re = don_mu + cholesky_decompose(don_day_cov) * don_day_eta;
  }
  
  // We create the u_pars vector with 0 for third candidate for identifiability
  for (n in 1:T) {
    u_pars[n][1] = hill_day_re[n];
    u_pars[n][2] = don_day_re[n];
    u_pars[n][3] = 0;
  }
  
  // The poll random effects and house random effects are multi_normal around 0
  poll_re_km1 = diag_pre_multiply(poll_sd, L_poll) * poll_eta;
  house_re_km1 = diag_pre_multiply(house_sd, L_house) * house_eta;

  // Each poll has some bias, com ing from the poll bias and the house bias
  for (n in 1:N) {
    u_pars_poll[n][1] = poll_re_km1[1,n] + house_re_km1[1,poll_house[n]];
    u_pars_poll[n][2] = poll_re_km1[2,n] + house_re_km1[2,poll_house[n]];
    u_pars_poll[n][3] = 0;
  }
}
model {
  
  // priors
  hill_day_eta ~ normal(0, 1);
  don_day_eta ~ normal(0, 1);
  to_vector(poll_eta) ~ normal(0, 1);
  to_vector(house_eta) ~ normal(0, 1);
  poll_sd ~ normal(0, 1);
  house_sd ~ normal(0, 1);
  L_poll ~ lkj_corr_cholesky(2);
  L_house ~ lkj_corr_cholesky(2);
  hill_phi ~ beta(5, 5);
    hill_sd ~ normal(0, 1);
    don_sd ~ normal(0, 1);
  don_phi ~ beta(5, 5);
  hill_mu ~ normal(1.2, 0.3);
  don_mu ~ normal(1, 0.3);
  
  // The likelihood
  
  for (n in 1:N) 
    poll_numbers[n] ~ multinomial(softmax(u_pars[time[n]] + u_pars_poll[n]));
}
generated quantities {
  matrix[T, 3] preference_share;
  for(t in 1:T) {
    // Exclude any poll or house biases
    preference_share[t] = softmax(u_pars[t])';
  }
}
We pull the data and do some cleaning here:
library(rvest); library(dplyr); library(ggplot2); library(rstan); 
library(reshape2); library(stringr); library(lubridate); library(arm)

realclearpolitics_all <- 
  read_html("http://www.realclearpolitics.com/epolls/2016/president/us/general_election_trump_vs_clinton-5491.html#polls")

# Scrape the data
polls <- realclearpolitics_all %>% 
  html_node(xpath = '//*[@id="polling-data-full"]/table') %>% 
  html_table() %>% 
  filter(Poll != "RCP Average")

polls <- polls %>% 
  mutate(
    n = as.integer(stringr::str_extract(Sample, '[0-9]+'))
  )


# Function to convert string dates to actual dates
get_first_date <- function(x){
  last_year <- cumsum(x=="12/22 - 12/23")>0
  dates <- str_split(x, " - ")
  dates <- lapply(1:length(dates), function(x) as.Date(paste0(dates[[x]], 
                                                              ifelse(last_year[x], "/2015", "/2016")), 
                                                       format = "%m/%d/%Y"))
  first_date <- lapply(dates, function(x) x[1]) %>% unlist
  second_date <- lapply(dates, function(x) x[2])%>% unlist
  data_frame(first_date = as.Date(first_date, origin = "1970-01-01"), 
             second_date = as.Date(second_date, origin = "1970-01-01"))
}


polls2 <- polls %>% 
mutate(start_date = get_first_date(Date)[[1]],
       end_date = get_first_date(Date)[[2]],
       N = as.numeric(gsub("[A-Z]*", "", Sample))) %>% 
dplyr::select(Poll, end_date, `Clinton (D)`, `Trump (R)`,  N) %>%
arrange(end_date) %>%
rename(Clinton = `Clinton (D)`,
       Trump = `Trump (R)`) %>%
filter(!is.na(N) & end_date >= ymd('2016-05-01')) %>%
mutate(time = as.numeric(end_date) - as.numeric(min(end_date)) + 1,
       n_hill = floor(Clinton/100*N),
       n_don = floor(Trump/100*N),
       n_uk = N - n_hill - n_don)

polls2$Poll[grepl('Rasmussen',polls2$Poll)] <- 'ras'
polls2$Poll[grepl('Quinn',polls2$Poll)] <- 'quin'
polls2$Poll[grepl('CNN',polls2$Poll)] <- 'cnn'
polls2$Poll[grepl('WSJ',polls2$Poll)] <- 'wsj'
polls2$Poll[grepl('FOX',polls2$Poll)] <- 'fox'
polls2$Poll[grepl('Bloomberg',polls2$Poll)] <- 'bbg'
polls2$Poll[grepl('LA Times',polls2$Poll)] <- 'lat'
polls2$Poll[grepl('Economist',polls2$Poll)] <- 'econ'
polls2$Poll[grepl('NBC',polls2$Poll)] <- 'cnbc'
polls2$Poll[grepl('Pew',polls2$Poll)] <- 'pew'
polls2$Poll[grepl('Reuters',polls2$Poll)] <- 'reut'
polls2$Poll[grepl('Gravis',polls2$Poll)] <- 'gravis'
polls2$Poll[grepl('PPP',polls2$Poll)] <- 'ppp'
polls2$Poll[grepl('GWU',polls2$Poll)] <- 'gwu'
polls2$Poll[grepl('Wash Post',polls2$Poll)] <- 'wp'
polls2$Poll[grepl('CBS',polls2$Poll)] <- 'cbs'
polls2$Poll[grepl('Monmouth',polls2$Poll)] <- 'monmouth'
polls2$Poll[grepl('Survey',polls2$Poll)] <- 'surveyusa'
polls2$Poll[grepl('IBD',polls2$Poll)] <- 'ibd'
polls2$Poll[grepl('USA Today',polls2$Poll)] <- 'usatoday'
polls2$Poll[grepl('Marist',polls2$Poll)] <- 'marist'

hill_dat_prep <- polls2 %>% 
  filter(end_date >= ymd('2016-05-01'))

stan_dat <- list(N = nrow(polls2),
                 T = max(polls2$time),
                 time = polls2$time,
                 poll_numbers = data.matrix(polls2[,c('n_hill','n_don','n_uk')]),
                 poll_house = as.integer(factor(hill_dat_prep$Poll)),
                 n_house = length(unique(hill_dat_prep$Poll)))
And run the model using the following (this takes around 12 minutes on my 4-core laptop):
options(mc.cores = parallel::detectCores())
polls_model <- stan(file = "MRT/toy_hill_don.stan", 
                    data = stan_dat,
                    iter = 400)
Finally, we can plot 95 per cent credibility intervals for the preference share like so:
dates <- data_frame(time_period = 1:max(stan_dat$time),
                    Date = seq(from = min(polls2$end_date), by = "day", length.out = max(stan_dat$time)))

as.data.frame(polls_model) %>%
  dplyr::select(contains("preference_share")) %>% 
  melt() %>% 
  group_by(variable) %>% 
  summarise(mean = mean(value),
            lower = quantile(value, 0.025),
            upper = quantile(value, 0.975)) %>%
  mutate(candidate = readr::parse_number(str_extract(pattern = ",[0-9]{1}", string = variable)),
         time_period = readr::parse_number(str_extract(pattern = "[0-9]{1,3},", string = variable)),
         candidate = ifelse(candidate==1, "1. Clinton", ifelse(candidate == 2, "2. Trump", "3. Other"))) %>%
  filter(candidate != "3. Other") %>%
  left_join(dates, by = "time_period") %>% 
  ggplot(aes(x = Date, colour = as.factor(candidate), fill = as.factor(candidate))) +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha= .3) +
  geom_line(aes(y = mean)) +
  scale_fill_manual(values = c("blue", "red"), "Candidate") +
  scale_colour_manual(values = c("blue", "red"), guide = F) +
  ggthemes::theme_economist() +
  ggtitle("Aggregated poll preference share estimates", subtitle = "Trangucci model")

What is wrong with this model?

This model is obviously a lot better than the toy model that I put together. It doesn’t imply unbounded preference shares, and makes adjustment for biases in measurement. Yet it could still use some improvement.
  • The changes in preference shares between candidates are independent; a drop in one candidate’s share will increase the others’ in roughly proportion to their pre-existing shares. That mightn’t be a good assumption.
  • It takes forever to run!
  • The covariance matrix Rob has used is the covariance matrix of an AR(1) process. So we probably haven’t gotten all we can out of the cost of implementing a Gaussian Process, while imposing the time cost. The implementation above should be seen as the starting point for building a more elaborate model. 
These are all candidates for improvement.