The debating, campaigning and speculating will soon be over. In just a few short weeks, our new President will be elected.
Somehow, through this long and arduous process, the parallels between website A/B testing and the election became apparent to me as the campaign continued to grind on.
This shouldn’t come as much of a surprise. After all, an election, particularly a Presidential election, has essentially become the ultimate marketing campaign. Of course, one of the objectives of this particular marketing campaign is to make it appear as if it is NOT a marketing campaign…anyone remember “New Coke”?
With a myriad of contenders to the throne weeded out over the past year, it has now come down to the ultimate A/B test. But unlike our website A/B testing, we are all bombarded with images and soundbites of both A and B on a continuous and unrelenting basis.
Perhaps where the comparison to website testing is most relevant is when we look at the impact of bias in both scenarios.
Just a Bunch of Bias
“Minimize bias” is fairly standard advice to pollsters and A/B testers alike. We think we understand what that means: Take a random sample, don’t sample friends and coworkers, don’t artificially lead those surveyed towards one option or another, etc. But there is actually much more to it than that.
In Science and Engineering, bias is something you can assign an exact number value to. For example, if a scale consistently weighs a product 0.5 pounds heavier than the actual weight, you would say that scale has a bias of +0.5 pounds.
In attribute testing including surveys, polling and A/B testing, bias means something altogether different. Picture a large vat of chemicals being mixed together or a stew being cooked on the stovetop. If you want to draw a sample with which you can evaluate the entire mixture, you make sure your stew is well stirred, that you are sampling from an area in the center that was representative of the whole batch, that you took your sample before the stew had been sitting for long, etc.
Assuming you have done everything in your power to make your sample representative, there are still many other factors that can cause bias in sampling plans such as:
- Sampling outside of the target population
- Tainting of the sample itself by outside influences
- Tainting the testing of the sample by outside influences
- Under-coverage of one group or another that make up the population
Both A/B testing and Presidential polling are susceptible to all of these sources of bias, and many more.
Polling: The Long and Short of It
The Gallup poll has long been considered the gold standard of polling, particularly when it comes to Presidential elections, and with good reason. With one notable exception, this poll has been extremely accurate in predicting the election results over the past century.
In 1948, however, the Gallup poll memorably predicted Thomas Dewey would defeat Harry Truman by a sizable margin, leading to the infamous photo of President Truman holding up the erroneous headline, “Dewey Defeats Truman” while boarding a train the day after the election.
Why was this poll so far off? At the time, there were two main causes assigned to the error. At least one of them could be a cautionary tale for our A/B testing:
1. The pollsters stopped collecting data too far ahead of election day
Prior to Truman’s presidency, Franklin D. Roosevelt was already so well known that most voters were either solidly with him or solidly against him after he had been in office for many years. Roosevelt’s opponents didn’t surge, since his supporters didn’t waiver. This led to the false assumption that a poll one or two weeks prior to an election would hold up on Election Day. As we now realize, that is not always the case.
If your A/B test shows a clear winner of one option over the other on a given day, keep in mind that the results are just that; “on a given day”. The same test might have an opposite outcome a month or even a week later. That may be one reason why large companies sometimes implement continuous testing methods, such as Bandit testing.
2. The method of sampling known as “quota sampling” was prone to error
The idea behind quota sampling was to simulate the overall demographic in your sample population by following percentage rules based on the actual population breakdown. For example, you would poll numbers of each race and sex in equal proportion to the percentages in the general population. The flaw in this logic came from the difficulty in simulating every aspect of the population. For example, if the male voters you polled were all in the South, was this sample representative of the North as well? There were simply too many categories and combinations to mimic through quotas.
Sixty-eight years after Dewey vs. Truman, bias in Presidential election polling is still alive and well. Let’s take a look at what some recent polls are telling us, break down the bias, noise, and errors that may be influencing them, and see what valuable lessons from these polls can be applied to website testing.
If you can’t convince them, confuse them.
– Harry S Truman
Let’s Head to the Polls
1. WikiLeaks: A Poll
US poll: Who will you vote to become President?
— WikiLeaks (@wikileaks) August 18, 2016
With the recent convention headlines of E-mails and files leaked from the Democratic National Committee, it would not be surprising to learn that Twitter followers of WikiLeaks may be more non-establishment than the general population. But there may be an even stronger influence than follower demographics for this particular internet poll and others like it.
According to the traditional Margin of Error (MOE) estimation formula used by pollsters, this survey boasts a tiny +/-0.32% MOE, thanks to the large sample size.
An online sample size calculator produced roughly the same number (confidence interval is another way of stating margin of error) at a 95% confidence level, when the sample size and approximate number of U.S. voters were entered in.
However, internet surveys for U.S. elections are inherently unreliable based on what I call Internet bias. Political leaning of followers aside, the internet is a global entity, not strictly a U.S. one. Going back to our stew analogy, this means we could be sampling from our vat, along with those of a dozen or so other restaurants all at once. We all know there is no better way to spoil the broth than too many cooks.
To overstate the obvious, any type of opinion survey, including an A/B test, must be performed within the bounds of the user population. For example, if your customers reside in the Southwestern states, any testing outside of that region cannot predict behavior inside the region. When it does, it is due to coincidence rather than analysis. Just remember that bias trumps sample size every time (pun intended).
2. Sign of The (L.A.) Times
Polling data published in the LA Times comes from the USC Dornsife poll. As the name suggests, the polling is performed in conjunction with the USC School of Politics. Despite appearances, the graph isn’t simply a direct percentage of how the respondents replied. The numbers are actually weighted to adjust for race and gender percentages in the overall population, as well as how likely a respondent reports that they will actually vote.
Similar to the quota sampling of Dewey vs. Truman, weighting of sample groups brings with its own form of bias. Since the margin of error for your survey is based on the sample size, weighting of the data artificially inflates the sample size for one group or another without considering the impact on margin of error. For example, if there were 5 women surveyed in your poll, but you needed 50 to make up a proportionate sample, multiplying responses of the 5 women by a factor of 10 would not necessarily be statistically valid.
The margin of error can be adjusted mathematically to adjust for the weighting, but this is much more complicated than the basic formula and is rarely taken into account.
There is simply no substitute for an adequate sample size. Although weighting may be a convenient way to adjust your sampling, and is actually a fairly common practice in polling and surveying, you may be adding bias to your study rather than reducing it.
As your sample size increases as a percentage of the population, the percentage breakdown of respondents by demographic will naturally approach the actual levels. Since A/B testing typically has an easier path to higher sample size, simply by increasing the duration of the test, weighting is luckily something we shouldn’t need to consider. More on this topic later.
3. Same Day, Different Result
The Wall Street Journal is one of the most respected news organizations in the country. This poll was published at roughly the same time as the LA Times (USC Dornsife) poll, yet the two candidates are nearly flip-flopped in their percentages. With 922 likely voters surveyed, their calculation of +/-3.2% MOE is accurate. That means the MOE boundaries could potentially put the two candidates in a dead heat, but would not approach the percentages shown in the LA Times poll.
Why such a disparity?
One obvious difference between the two polls is the absence of weighting factors in the WSJ poll, but there are several others, including:
- USC Dornsife respondents are recruited by (paper) mail, WSJ survey is performed by phone.
- USC Dornsife respondents are sometimes paid for participation, WSJ respondents are not.
- USC Dornsife respondents may participate in repeated or ongoing surveys, whereas the WSJ survey is via a single phone call.
Both mailed and telephone surveys are susceptible to nonresponse bias, meaning the views of the sub-group who responds may be skewed compared to those who do not. With the survey methods themselves being much different, the effect of nonresponse bias on the polling numbers is a huge unknown.
As surveys (out of necessity) rely more on cell phone cold-calling, one can only imagine the very small percentage of those called who are actually willing to participate, and how this might skew the results. ABC News polling methods have adapted from 10% cell phone in 2008 to 33.5% cell phone in 2015. I can also think of many examples where a person’s cell phone number is from a state they have not resided for some time. Are the pollsters able to sort this out?
The best data is always “organic” data, but the frustration of nonresponse can sometimes lead to shortcuts. In an ideal A/B test, users find your website through the same channels they always have, only ½ now see a slightly different variation. If conversion rates are relatively low, it may be a long, slow road to a statistically significant sample size. But this might be the price to be paid for lack of bias and meaningful results.
4. What was the Question?
The quantity and variety of polls available online and in print is simply staggering. In fact, there are so many polls published that an occasional double-take is needed to realize what is actually being presented. “Who Do You Expect to Win?” may seem equivalent to voting intention, but is actually very different. Imagine asking a group of people whether our hypothetical stew might appeal to customers, rather than whether the tasters themselves would ever order it.
Similarly, polls that ask whether you have a “favorable opinion” of someone can also be misleading. It’s not uncommon to have a favorable opinion of someone, but vote for someone else.
Within all the noise of published polls and data, the issue can sometimes be awareness rather than bias. Whether we realize it or not, we are all directly influenced by the opinions and preferences of others.
“If the majority of Americans are for it, maybe I should consider it too?”
Analytics data can now provide us with more information about our website visitor behavior than we could ever fully tap into. Like pulling the lever for candidate A or candidate B, what we are really interested in is predicting conversions. Option B may get more clicks, or visitors may stay on the site longer with option A, but it is important not to confuse these metrics with the experimental success criteria.
When reviewing any form of data, we need to be aware of not only the sample size, margin of error and sampling method, but carefully review the relevance of the data presented to what we are studying.
5. The New Paradigm: YouGov
Much of the polling data posted on Twitter accounts or found on popular news websites comes from a common source: www.yougov.com. The polling methods utilized by YouGov include internet-based sampling and weighting of demographics. As I alluded to earlier, this combination could be a recipe for bias and error, yet the poll seems to be in close alignment with the best conventional polling methods, including the aforementioned ABC News/Washington Post poll.
How do they do it?
To minimize the bias inherent to internet-based polling, YouGov uses an “opt-in” internet panel. This means respondents sign up to be part of the process, and are vetted to confirm whether they are registered voters in the U.S.
Although the weighting formula was not published, it was interesting to note that the published margin of error was also adjusted (made larger) to compensate for the impact of weighting on sample size. Based on the number of respondents (1300) I calculated the margin of error to be +/-2.7%. YouGov’s adjustment for weighting factors increases the MOE to +/-3.9%.
The internet, when used appropriately, can be our friend. The key is in knowing who is providing your sampling data, whether they are a representative part of the population and in turn whether this data can be relied upon to predict future behavior. YouGov’s opt-in model might be analogous to running an A/B test using your E-mail subscribers as a sample pool. In YouGov’s case, this has translated to predicting the general population. In the case of A/B testing, it would probably only be relevant to that specific demographic, i.e. subscribers.
Debating the Debate
Wow. More CNN instapolls show huge Clinton victory.
"Better understanding of issues"?
Trump: 27% pic.twitter.com/3PXNKqBaf9
— Dan Diamond (@ddiamond) September 27, 2016
Before I had even watched the actual debate, I saw the results splashed all over the internet. The polling indicated a lopsided victory. Later that evening, I watched the recorded debate and learned more about how the polling was performed. The poll was conducted by telephone interview on a sample of 521 “likely voters”, of which 41% identified as Democrat and 26% identified as Republican
According to Gallup, the number of Americans who report a party affiliation of Democrat has been within a few percentage points of those reporting to be Republican over the past several years.
What’s wrong with this picture?
If you read the very fine print, the margin of error reported for this poll was +/-4.5%. Based on sample size alone, this would be a fairly accurate number, but this poll has several big issues when it comes to bias.
First, the telephone polling method itself is highly susceptible to nonresponse bias and perhaps other sources of bias, depending on where the phone list originated. As we discussed earlier, simply finding someone who will answer an unknown phone call has become challenging.
Secondly, although weighting is not necessarily a statistically valid practice, this survey definitely needed to be weighted to have validity, based on the uneven breakdown of respondents, but was not (nor was the MOE). Out of curiosity, I did my own basic weighting of the numbers to correct for party affiliation bias to see where it might end up.
An even split of independent voters would put the weighted results somewhere around 62% Democrat to 38% Republican. With a margin of error factored in, this could be at least as tight as 57% to 43%, if not closer. In other words, not quite the lopsided margin it appeared to be.
Everyone wants answers quickly and decisively. This example shows how a poll with several sources of bias quickly became public record. In turn, this bias can influence the opinions and leanings of other voters, then ultimately decide the election. The fast answer isn’t always the right one. Take your time in designing and executing an A/B test. It will be well worth it.
Having learned from the experiences of many successful businesses, the major candidates in this year’s Presidential election have utilized analytics data and A/B testing to optimize their websites and other elements of their campaigns. Although they are competing for votes rather than sales, they rely on this feedback from the public to decide what is working and what is not.
Perhaps this is why I found it ironic that the polling itself came to resemble an ongoing A/B test. If the polls are laden with bias, we may be surprised on Election Day. If our A/B testing suffers from bias, our reality will not match expectations once we select and implement the winning option. Either way, we will have been influenced and potentially disappointed by false assumptions.
Sample size and margin of error are extremely important. In fact, they are often the only metrics provided when a poll is published. As important as sample size is to accurate prediction, it can become almost meaningless under the weight of too much bias.
When it comes time to vote, bias in the polls may sway our preference or even keep us on the sidelines. On a smaller scale, a biased A/B test could be equally catastrophic in determining the success or failure of our website or business.