I should not be too surprised that my first recollection of pollution coincided with my first childhood trip to Los Angeles. As my family and I first drove through the valley, a smoky haze appeared, seemingly out of nowhere. It wasn’t so much what I was seeing that I remembered, but what I could now suddenly not see.
Pollution can prevent you from seeing the landscape, the water…or your A/B test results, with optimal clarity. It can also degrade the overall quality of your experience in many ways.
Just as there are many different types of environmental pollution (including some we don’t often think about) there are probably more sources of A/B test pollution than we care to realize as well.
Whether you are planning, executing, or analyzing your website testing, you need to consider all of these sources of test pollution to prevent your results from being left in the haze.
The Usual Suspects
Think of these first three as the 69’ Camaro that can’t pass a smog test, or the fast food wrappers collecting on the side of the freeway – the obvious sources of pollution. They are still worth mentioning, since sometimes they can be so obvious that they are hiding in plain sight.
Biased (skewed) Data
In the world of statistics, bias is essentially the opposite of random. Biased sampling is probably the most common, and also dangerous, potential source of pollution.
In an A/B test, or any other type of statistical experiment or test, you are drawing a sample from the overall population. That means your sample needs to be as representative as possible. A good sample doesn’t systematically favor certain groups within the target population, and doesn’t exclude certain segments either.
Try to avoid what is known as “convenience sampling”. This is when a group who is easy to access, such as your friends or co-workers, is used to perform some or all of the A/B testing. In fact, these are the exact people to avoid as test subjects, even if they are represented equally on both legs of the test, since they will usually introduce bias of some kind into the testing.
In short, make sure both the A (control) and B (test) legs are randomly drawn from your overall population of users.
Inadequate Sample Size
You probably calculated your necessary sample size based on the confidence level and confidence interval you are looking for.
Again, these sample sizes apply to both legs of your test. If, for any reason, your actual sample size winds up being smaller, STOP. You need to consider the effects of pollution this has created.
You can still go ahead and analyze your data, but keep in mind that your Confidence Interval (mean result ± margin of error on either side of your mean) will now become larger.
Are your results still precise enough for the difference you wanted to detect? If not, it may be time to rinse and repeat.
Be sure to read our article When Are You Ready For A/B Testing?, as it covers the topics of confidence levels, confidence intervals and sample size in detail.
This one is pretty obvious too, but I am including it for added emphasis just the same. First, you want to make sure the duration of your test, e.g. the period of time you collect your data, is long enough to include at least one, and preferably two or more full business cycles. For most eCommerce businesses, this basically means a period of about 2 weeks or longer. Be sure to take a thorough look at your business to identify the actual duration of your business cycle.
Going back to your sample size, make sure that the population (website visitors) you use in your calculations is the number of visitors for this same test time period.
If your test duration is too short, you will fail to accurately represent the natural fluctuations of a business cycle in your testing, even if your sample size was adequate.
If your test duration is too long, you are introducing other sources of noise pollution into your testing, such as repeat traffic, non-standard time segments such as Holidays, and other non-related user behavior patterns clouding your results.
Now that you have gotten a handle on these first three pollution sources, you are well on the way to a cleaner and “greener” A/B test. But just like the odorless, colorless CO2 gas slowly creating a greenhouse effect in our atmosphere, there are a few other less obvious sources of A/B test pollution you need to keep in mind.
Confounding is the A/B test equivalent of light pollution.
In the city, all the combined street, home and building lights can prevent us from seeing the stars at night. This is known as light pollution.
The “star” of our A/B test, of course, is the variable you are modifying. The light pollution comes from anything else (even something minor) you might have changed at the same time, therefore dimming our star.
For example, you might think it is harmless to piggyback a minor landing page copy change with the new order button you are A/B testing. But as soon as you change even one other feature, no matter how minor, you are confounding your results.
Confounding essentially means influencing the evaluation of one factor, intentionally or not, with the introduction of another factor. In any A/B test, you need to keep all elements you are not testing stable.
If you really want to test 2 or more factors (changes) at a time, there’s a great way to do it. One of my favorite statistical tools is known as Design of Experiments (DOE). In the business world, it is also known as multivariate testing.
Using a matrix like the one shown here, you can systematically test multiple changes at once, as well as the effects resulting from combinations of changes.
When you analyze your results, you will be able to find out which changes were significant in influencing your output (conversion rate), which combination of factors worked best, and even which factors might have cancelled out the effects of one another factor.
The down side is that you will need a much bigger sample size for your testing than you needed for A/B testing, since each combination (8 combinations required for 3 factors at 2 levels in the example) will need it’s own equally large sample size to produce results with the same accuracy.
If you intend to test several factors at a time, multivariate testing (DOE) is an awesome tool. Just make sure you don’t pollute your A/B testing with anything more than one change at a time.
2. Funnel Pollution
When you see floating debris of all shapes and sizes and…smells in the lake (pick any lake), keep in mind that it probably didn’t start out there. In fact, it could have originated in any of dozens of rivers and streams that feed into that lake.
Your funneling channels are the rivers and streams of your A/B testing, and understanding their impact is important in preventing the pollution of your test.
You should always pay attention to the top, the middle and the bottom of the funnel.
Obviously, there are non-organic entry points to your sales funnel that, by nature, stand a much better chance of converting. If you stack your A/B testing deck so that most of your subjects are drawn from a more likely conversion channel, such as an E-mail list of previous customers, you are introducing pollution, even if this applies to both the A and B sides of your test.
If you are relying completely on organic (search engine, link, etc.) traffic to make up your test population, and you find that too many subjects bounce before they even get to the A/B test, you may want to refocus your testing upstream, and optimize your sales funnel before you run that test.
If you are deliberately short-circuiting the sales funnel, such as using the afore-mentioned E-mail with a direct link to the order page to simplify things, you have now deviated from the natural behavior of your users, and introduced more pollution into the testing.
Use your analytics to study the funnel behavior of your users. Then try to conduct your test so that both the A and B groups are funneling into the test variable just as randomly as the overall population would.
3. GUI Pollution
I would consider this form of pollution to be closely related to confounding, but also different in a few subtle ways.
The GUI (graphical user interface) is the method your visitors use to access your website. This includes different device types, as well as different web browser types, which is obviously a huge number of possibilities these days.
Since there are so many different GUI’s in use these days, you either need to consider all of them, some of them, or none of them.
I’ll explain what I mean by this.
If you are concerned with attracting customers from a specific GUI segments (say, iPhones vs. Droids), then you might want to look at the GUI as an additional variable, and use multivariate testing to try out combinations of new content vs. device (GUI).
The other way to handle this source of pollution is to make sure the GUI is as randomized as possible in your single factor A/B test. Your GUI spread on either the control or test side of the A/B should look about the same, with test subjects randomly selected from all types of GUI’s.
4. Paralysis by Analysis
Botched data analysis is like the pipeline continuing to leak on the ocean floor while you skim the oil off the surface: You have wasted your valuable time while polluting your A/B test if you don’t analyze your results carefully.
This can happen in many different ways, so let’s explore a few of them:
This term gets thrown around quite a bit. Many people confuse significance with ironclad truth. This is not the case!
If you are significant to a level of .05 (a commonly chosen target value), that means you still stand a 5% chance of believing you have detected a difference between A and B conversion rates, when in reality there was no difference at all. The statistical term for this is a “Type 1 error”.
If we don’t succeed, we run the risk of failure.
You trend your data halfway through your A/B test and ‘B’ is leading by a huge margin. Time to stop the test and declare a winner, right?
Wanting your new feature to be an improvement doesn’t always make it so, but it is human nature to fish for data that points in the direction you want it to.
The same applies to repeating the same experiment until you get the result you wanted. All that really means is you finally beat the odds and stumbled onto a Type 1 error, not that ‘B’ was finally proven to be an enhancement.
Margin of Error
Significance aside, it’s easy to find ourselves looking at the raw data and drawing conclusions right away. For example, we might notice that our conversion rate was 2% for the control group, and 10% for the test (B) group. Before you jump to any conclusions, don’t forget about that voice of reality known as Margin of Error (MOE).
If your MOE was ±2%, it’s very likely you have hit onto something. If it was ±10% you should not be too quick to draw conclusions. This also means your sample size could stand to be a bit larger.
Pollution is an appropriate term to describe the myriad of outside factors that can negatively impact the clarity and accuracy of your A/B testing. If want to maximize your CRO gains, you should be as vigilant about preventing test pollution as you are about recycling cans and bottles.
Just like environmental pollution, all of the 7 sources of pollution we have discussed are 100% human-caused. The difference is that in A/B test pollution, we hold 100% of the power to either allow or prevent the polluting.
Controlling your sample size, sample bias, and setting the correct duration for your A/B testing should be just the beginning. Throughout your testing, be aware of confounding, GUI, and funneling pollution that can wreak havoc on your data.
Finally, make sure you stay disciplined when it comes to analysis. It would be a shame to ruin a crystal clear A/B test with a few drops of analysis impurity!