Confidence Interval: A range of values calculated such that there is a known probability that the true mean of a parameter lies within it.
The science of statistics is all about predicting results by sampling a portion of a population. Since you can never be 100% certain of that prediction, the result is often expressed as a possible range of values. This range is also known as the confidence interval.
For example, you might estimate average body weight based on a random sample of 500 men and 500 women. Your sample results will vary, so you need to add a measure of that variability to your estimate. When we refer to a statistic, such as average (mean) weight, being a certain value “plus or minus” another value, that plus or minus number is the margin of error, and the entire range of possible results is the confidence interval.
A confidence interval can be used in an A/B test design. When you calculate the required sample size, you will sometimes be asked to select a target confidence interval, depending on what calculator or software you are using. If you want to accurately detect a small difference between A and B, you need to select a small confidence interval. In the sample size calculator below, a 2% confidence interval is selected.
If the means observed for A and B in your test are less than 2% different from one another, your confidence intervals will overlap, meaning the results are “within the margin of error” and inconclusive. As you can see, this small confidence interval brings with it a large sample size, at least when compared to the overall population size.
If you did not originally include a confidence interval in your sample size calculation, it can still be used after the A/B test has concluded, to help you interpret the results. It can be calculated based on your sample size, population and confidence level.
For example, let’s say you ran your A/B test and the results looked like the graph below.
Group “B” had a higher mean conversion rate than Group “A”, and the two confidence intervals, although large, did not overlap. This is a good indication that B is reliably better than A, and that your results are significant.