“That can’t be right.”
You ran a split test and picked a winner.
But fast-forward a month and your winner has turned into a loser.
Despite being on the right track, there are some things, both in and out of your control, that could lead to declaring the wrong variation a winner.
But don’t worry, it’s nothing to be afraid of. Just understand these 6 common problems with split tests and you’ll be good to go.
1. Your Software is Off
There are hundreds of split testing software options out there. Maybe you’ve even created your own custom software.
Either way, some of them are off — that’s a fact.
Most of them are reliable, but you should always confirm that your software is working as it should beforehand.
To do this, you need to run an A/A test.
The most basic type of split test is an A/B test, where you have two variations, A and B. With an A/A test, you simply create two versions of the same, original page, and split your traffic to both.
This might seem silly at first, but you’ll sometimes get significantly different results. In other words, one of your (identical) pages will win.
If this happens, it’s typically due to a problem with your software. Either it’s calculating an important variable incorrectly, like sample size or confidence, or it’s not recording conversion events as it should, like tracking button clicks.
This is the reason why Neil recommends to always run an A/A test before running your actual split tests.
2. You Didn’t Segment Your Traffic
Segmenting is the most important part of analyzing the results of a split test.
It refers to looking at the conversion rates of different sections of your traffic. For example:
- returning visitors
- new visitors
- visitors from different countries
- visitors from different traffic sources
- visitors on different days
- visitors on different devices
- visitors with different browsers
You need to identify the most likely segments that could affect the conversion rate of your traffic. Typically, starting with traffic source is a good idea.
Peep Laja of ConversionXL recommends getting 250-400 conversions for each segment that you’re analyzing. But he has clarified that “magic numbers don’t exist.” You need to run a sample size calculator to know how many conversions are needed for each specific test.
3. You’ve Found the 1%
When you run a split test, you continue until you reach a minimum sample size and confidence.
Under good conditions, your confidence level will be 99%. If you don’t have the traffic, it might only be 95%, but hopefully not less.
For a 99% confidence test, you will conclude the wrong result 1% of the time — it’s unavoidable. For 95% confidence, one result for every 20 tests will be incorrect.
This is completely out of your control, and one day in the future you may discover that the “winning” variation is actually performing worse. The only way to check it would be to rerun the test from scratch.
4. You Didn’t Test a Full Business Cycle
The reason you segment your traffic is that visitors in different situations behave differently.
But you can look at your visitors by day and decide that you had a significant amount of conversions to declare your test a winner. In some cases, this is not true.
Whenever you do a split test, it needs to be run for a full business cycle. For many businesses, running tests in intervals of weeks will be sufficient.
However, some businesses operate on other cycles, like monthly cycles. Buying will pick up or slow down in the first and last weeks of the month. While you might be tempted to just run tests during the middle weeks of the month, those visitors at the start and end are necessary to get a whole picture of the test effects.
Figure out how long your business cycle is, and then make sure you have enough conversions in each part of your cycle to be confident of the final split test result.
5. Your Special Day Wasn’t So Special
When you’re trying to run tests as quickly as possible, you don’t leave a lot of room for error.
A single abnormal day can affects your split test results enough to alter or change the final result. The abnormal day could be the cause of very high or very low conversions. In any case, you can’t assume that an abnormal day affects all variations equally.
So what causes abnormal days when it comes to conversions? It could be many things, but typically will be:
- the start or end of a sale
- a large mention in the press (positive or negative)
- a niche-specific event
Take Christmas, for example. It affects a huge number of businesses. But the visitors who come to your site, not just during Christmas but during the entire month of December, will probably be different from those who visit during January.
It doesn’t mean that you can’t run split tests during these months, but you may need to repeat the tests later to confirm results or gain additional insights.
If it’s only a single day that affects your results, you can either remove that day from your results,or raise your test duration (always an okay option) to drown out the noise.
6. Your Test Won, But Only For Some
Imagine this: Your original control option won the test by a small margin, let’s say 2%. Rightfully so, you stick with that and discard your variation.
But you didn’t dig deep enough.
Like I said earlier, it all comes back to segmenting at one point or another when analyzing results. You may look through all your major segments and think everything is fine, but miss something simple like browser or device issues.
Going back to our scenario, the variation actually performed 10% better on Chrome and Firefox, but performed 25% worse on Internet Explorer. Overall, if Internet Explorer makes up about a third of your traffic, this results in your variation performing 2% worse overall.
But clearly, it’s the better option. It’s likely you didn’t perform sufficient quality assurance checks up front. You should always see how your original and test pages look in multiple browsers (use BrowserShots), and check as many devices as you can (phones, tablets, laptops).
Say you do discover an anomaly like this. What do you do?
Since you suspect your variation should outperform the original, it’s worth taking the time to redesign your variation so it renders correctly, and then rerun your test. Obviously this takes a lot of time, which is why it’s best to double-check everything up front.
Most of the problems that arise from running split tests come from an over-reliance and trust on conversion tools.
Yes, they are powerful tools that can make your life a lot easier, but if you don’t understand how they work or what specific role they play in your testing, you’ll end up making one or more of the costly mistakes we’ve looked at in this article.
The biggest takeaway I hope you get from this post is that running a split test properly isn’t about just plugging in some ideas to a tool and analyzing the results later.
A significant amount of time should be spent verifying the software, the test setup, and identifying any potentially anomalies.
After all that, then you can start digging into the results and ensuring that you have a significant amount of conversions in each segment.
Can I ask you a personal question?: What software do you use to run your split tests, and what have you done to verify that you’re getting reliable, long-term results?
Read other Crazy Egg posts by Dale Cudmore