Wednesday, October 26, 2016

Polling American

So, Donald Trump is currently telling everyone that  "the polls are rigged". Turns out, he's right.

(This is called a "bait and switch." He's not; just stay with me a minute.)

So, statistics as a field is neat, if complex, and polling is largely applied statistics. I do a lot of work in Lean Six Sigma, which is all about polling statistics applied to industry - using polls of customers or of your process or your results to determine where your problems are and where your work is inconsistent. And polling, in general, is a very simple process: you choose a representative sample, ask your questions, and the size of your sample determines your margin of error - how likely it is that your result reflects the whole. And the interesting thing with it is that the size of the sample actually doesn't change that interval a lot - test it out here in this calculator. A sample size of 1000 people will tell you what 100,000,000 people are thinking within 3 points roughly 95% of the time.

And we've been polling long enough that we've figured out how to get a "representative sample" of the U.S. population. We've learned from the mistakes of the past - for example, in 1936, the Literary Digest conducted an infamous poll (one of the first of its kind) where it overwhelmingly found that FDR was going to lose his re-election bid by a huge margin. In fact, FDR won in one of the largest landslides ever, and eventually someone looked back at the polling data and realized that they gathered the names of poll members from telephone books, and that if you only polled people in 1936 who owned private telephones, you were only going to be sampling the highest income levels, so of course you were going to have an overwhelming number of people telling you they hated FDR. Enough sociological work has been done to help us understand the major factors that drive voting habits, so you can make sure your sample roughly matches overall percentages of age, gender, race, income, location, etc.

But - and here's the huge issue, and why Donald Trump's statement isn't completely wrong - but not every American voters actually votes. About 60% of voters turn out for presidential elections. So this is where pollsters have to also turn into pundits - they have to make some judgement on which 60% of Americans are going to show up to the polls in a few weeks. And this judgement isn't foolproof. Gallup Polling - one of the most respected and storied name in polling, who actually correctly predicted the 1936 election at the same time Literary Digest was spending more and coming to awful results - in 2012 predicted that Romney would win by 2 - 3 points, and its prediction was so off that they formally decided not to involve itself in presidential polling 2016. In their own post-mortem, they identified that they failed in 2012 because they oversampled white voters and non-coastal states. Several other polling firms fell into the same trap, deciding that the uptick in minority and youth turnout in 2008 was a one-time effect, and wouldn't be sustainable by Obama after four years of a presidency with tepid support and vocal opposition.

So when Trump says the polls are "rigged", he doesn't actually mean that polling firms are making up numbers. If you look at his actual statement, it's stupid, but follows a logic: he says that polling firms are oversampling Democrats and undersampling Republicans in their decision as to who "likely voters" are. The argument that the "likely voter" screen polling firms use is wrong isn't invalid; hell, I don't agree with a lot of the "likely voter" screens I've seen (I especially think the CNN polling firm is heavily biased towards older and non-white voters), which is one reason I think Clinton will outperform her polls just as Obama did in 2008.

What makes Trump's argument stupid is that he's arguing that Republican voters will show up in equal numbers to Democratic voters, and there's no logic to that. Democrats outnumber Republicans consistently over the last ten years, with currently numbers stating roughly 32% of Americans describe themselves as Democrats and 23% as Republicans. Saying that you should assume as many GOP voters as Democratic voters is laughable, and of course such an assumption would show Trump leading all the polls, because you'd be magicing up about 30,000,000 new Trump voters from nowhere. 

What's even more hilarious is that this isn't the first time this argument has been advanced, and everyone should know better by now. In 2012, blogger Dean Chambers became famous through "unskewing the polls" specifically by going into polling cross-tabs and adjusting the numbers so that an equal number of Republicans and Democrats were voting. Suddenly, Romney was leading all of the polls by 6%, and cruising towards victory! Romney's own internal staff followed Chamber's models and made the same adjustments, which is why Romney spent the last week of the campaign in Pennsylvania (a state Obama won by 5 points), didn't bother to write a concession speech, and had to get a ride home from his election night party from his son supposedly because he had expected the Secret Service to drive him as the president-elect.

So we can talk polling, and we can talk poll skewing, and we can talk about how well polls really reflect what's going on. But remember that Obama outperformed his polls in 2012; and Donald Trump consistently unperformed his polls in the 2016 primaries; and all of the early voting results so far indicate that the tens of millions of angry white disaffected new voters Trump promised to bring to the polls don't seem to exist. More on that next time.