So you’re running A/B tests like a good marketer should. But when are you declaring a test “done”?

Is it when you reach 100 conversions per variation? No.

Is it when you hit 95% statistical significance? No.

Is it whenever the testing tool tells you? No.

**All of these are very common misconceptions**. Let’s tackle them one by one, and figure out what the truth is.

## Forget magic numbers, calculate the needed sample size

The first step toward becoming a smart optimizer is to stop believing in magic and miracles. So all the advice about running the test until “100 conversions” or “200 conversions per variation” is as true as unicorns. This is math, not magic.

There is no fixed number of conversions you need to reach. Instead, you calculate the needed sample size (number of visitors who are part of the experiment) BEFORE you even start the test.

You can use one of the many sample size calculators online, like this or this. Here’s how you would do it with Evan Miller’s tool:

Now the tool tells you how many unique visitors per variation you need before you have enough sample size to base your conclusions on.

The higher the expected uplift (minimum detectable effect), the less sample size you need. What if you want to be able to detect at least a 20% uplift, but the uplift is only at 8% when you reach the pre-calculated sample size? That means you don’t actually have enough evidence to know one way or another.

## No magic numbers, but what about rough ballparks?

If you want rough ballparks – then yes, you could throw around numbers like you need *at least* 350 conversions per variation. Less than that and there’s a high chance that you have a too small sample size on your hands. Which in turn means that your A/B test result is meaningless – you don’t actually know which variation won.

Small websites that have less than 1,000 transactions (purchases, signups, leads etc.) per month might not even be ready for A/B testing yet. You could perhaps run one A/B test per month provided that the uplift is big enough to tell the difference.

Let’s say we have a website that does 450 transactions per month, and the conversion rate is 3%. That means we get 15,000 visits per month. We want to run an A/B test, and we’re optimistic about our test hypotheses – expecting a 20% lift. Here’s the kind of sample size we’d need if we’re using Optimizely:

In order to run an A/B test, we’d need to get 10,170 x 2 = 20,340 visitors that month. Therefore, we couldn’t run that test in one month, because we don’t have enough traffic. Running a test longer than a month might come with so much sample pollution that we wouldn’t be able to trust the results.

So that means that you’d need to go for a bigger win than 20% – but of course it’s not up to you really to decide how big you’re gonna win. Nobody knows what’s going to work and how well. (Of course, your chances of having a big win go up exponentially once you start coming up with data-driven test hypotheses).

Another thing – if you want to look at the A/B test results across segments (e.g. how the test did across different devices, browsers or traffic sources), then you need enough sample size PER SEGMENT before you can even look at the outcome.

## You need a representative sample

Just calculating the needed sample size is not enough. Not only do you need to make sure that your test sample has enough people, *but that they are representative of all of your traffic*.

That means that your sample has to include every day of the week (people behave differently on Monday mornings and Friday afternoons), every traffic source, your newsletter / blog publishing schedule and other things that might affect the behavior of your target audience (e.g. payday schedule).

The rule of the thumb is that you should run the test for at least 2 business cycles. That’s 2 to 4 weeks for most businesses.

## Only now look at the statistical significance

First of all – what do you think statistical significance (p-value) shows?

It doesn’t tell us the probability that B is better than A. Nor is it telling us the probability that we will make a mistake in selecting B over A. These are both extraordinarily commons misconceptions, but they are false.

Remember the p-value is just the probability of seeing a result (or more extreme result) given that the null hypothesis is true. If you feel like learning more about p-values, this post is a good rabbit hole to go down to.

Before you have enough sample size and enough representativeness in your sample, the statistical significance percentage is a meaningless number for our purposes. Ignore it until the 2 previous conditions have been met.

Oh, and you can’t let the testing tools do the thinking for you. Everyone who’s run enough tests has seen a tool “declare a winner” after 1 hour or 1 day. That’s complete nonsense. Statistical significance does not equal validity.

## Pay attention to error margins

Your testing tool will tell you the conversion rate of each variation – but bear in mind that those are actually not precise numbers. All the figures we look at are median points – ranges of possibilities, and the range gets smaller as time goes on and your sample size increases.

It’s not like A converts at 14.6% and B converts at 16.4% precisely – the tools will also tell you what the margin of error is. If the error margin is ±2% then it’s also possible that A converts at 13.6% and B converts at 18.4%. Or A might be 16.6% and B 14.4%.

So, the conversion range can be described as the margin of error you’re willing to accept.

Here’s an example where there’s no overlap:

And here’s a new test (uncooked) with a large error margin:

Notice how the difference interval is between -9% and +12%.

Ideally the two variations have very little to no overlap in their range – so you can be more confident in picking the right winner (and minimizing the chance of a false positive or false negative).

This is one of the reasons why people don’t see the promised lift – because it’s a prediction of the range of the lift and the median value is the one people focus on, not the +/- size.

## Conclusion

Most marketers are terrible at A/B testing statistics. Don’t be one of the ignorant fools. Stopping A/B tests too early is by far the most common mistake rookies make. If you need to polish up your stats skills, here’s a crash course.

So when do you stop a test? After these 4 conditions have been met:

- Always calculate the needed sample size ahead of time, and make sure you have at least that many people in your experiment
- Make sure you have enough representativeness in your sample, run it full weeks at a time, at least 2 business cycles
- No or minimal overlap in difference intervals
- Only look at statistical significance (95% or higher) once the 2 previous conditions have been met

If you want to start winning more tests, and get more impact per successful experiment, you need to study this conversion optimization guide.

**About the Author:** Peep Laja the founder of ConversionXL, and is one of the leading conversion optimization experts in the world. He helps companies grow via his conversion optimization agency ConversionXL.agency.

## ONE COMMENT

## Comment Policy

Please join the conversation! We like long and thoughtful communication.

Abrupt comments and gibberish will not be approved. Please, only use your

real name,not your business name or keywords. We rarely allow links in your comment.Finally, please use your favorite personal social media profile for the website field.

“Most marketers are terrible at A/B testing statistics.” have to agree with that statement, although the situation is slowly improving. Calculators like the one featured above, where statistical power is not taken into account don’t really help, though.

I’d love to see people move away from fixed-sample size tests as the ones proposed by Peep and onto proper sequential designs. They are both more efficient, and also provide proper stopping rules for efficacy and futility, e.g. the AGILE A/B testing method. Bayesian approaches have entered the equation as well, but I don’t think the average marketer or UX expert has any clue what these are based on, which is a problem when the time comes to interpret the results they provide…