Hopefully you had a wonderful Valentine’s day and enjoyed your time with a “significant” other.
What makes them significant? Is it their personality, the way they make you feel, or simply the fact that your hypothesis testing revealed less than a 5% chance that your feelings for them could be random?
The latter definition would probably not launch a successful internet dating site.
We often think of “significant” as something special and irreplaceable, but statistically, it means something else altogether.
When you are conducting A/B testing or any other statistical testing for your website, understanding significance can make or break your interpretation of results or set your CRO on the wrong path. The same can be said for the thousands of other statistical tests being performed every day, in every conceivable industry.
To get to the heart of significance, in the purest sense, I want to explore the concept, explain how it originated, and help you to understand it better when you interpret your website testing.
…and I’ll try to make it all as painless as watching the sunset with your sweetheart.
1. A Little History
Even though the use of statistics in general dates back many centuries, the modern concepts of significance, hypothesis testing, randomization and even Design of Experiments (DOE) can be traced back to one early twentieth-century genius: Sir Ronald Fisher (1890-1962).
Fisher was an Evolutionary Biologist and Statistician who had a passion for the study of evolution and natural selection in all types of animal and plant species. Over the course of his illustrious career, he also developed or popularized many of the common statistical tools we take for granted today.
Fisher used the tools he developed to help explain observations in biology such as dominance, mutations and genetic variance. We can use these same tools today to optimize and improve website content. The fact that these tools can now be used to analyze things that didn’t even exist when they were created seems rather amazing. Almost as amazing as performing advanced stats without a computer 100 years ago!
To describe the results of a statistical experiment as having a high likelihood of being true, Fisher popularized use of the word “significant”. Ever since then, this meaning has stuck like pollen on a mature flower stigma.
Among Fisher’s many interesting theories of natural selection was the “Sexy Son” hypothesis, https://en.wikipedia.org/wiki/Sexy_son_hypothesis which he presented to explain the phenomenon of women choosing promiscuous men as partners. He believed the motive of the women was that they would then have equally promiscuous sons, who would produce lots of ancestors to carry on their line. I’m not sure I’m totally buying it, but I guess that’s why they call it a theory.
Like all brilliant scientists, Fisher was prone to the occasional error. At least one or two of those errors still plague the world of statistics today.
“Anyone who has never made a mistake has never tried anything new.”
2. What’s Your Hypothesis?
To understand significance, you first need to understand a bit about hypothesis testing, since the two are always intertwined.
A hypothesis is just a theory, like believing someone to be guilty in a true crime drama. Once you come up with that theory, you need to set about gathering enough evidence to prove it.
The Null Hypothesis
Here’s where it tends to get confusing. You need to keep in mind that your null hypothesis is not the thing you want to prove, like proving your enhancement increases the conversion rate on your website, but is actually the opposite of that. The null hypothesis is always the theory that nothing changed. Usually the goal is to disprove this theory, not prove it.
Going back to the true crime drama, the null hypothesis is like a defendant being presumed innocent until proven guilty. The proven guilty part is the alternative hypothesis.
So if your null hypothesis is that two things are equal, and you are trying to prove some sort of effect, like B is better than A, you need to reject your null hypothesis in favor of your alternative hypothesis possibly being true for your test to be considered successful.
Your alternative hypothesis can include a numeric value, such as B – A > 20%. But for simplicity sake, this is what your null and alternative hypothesis’ will look like when you are just trying to show that B has a better conversion rate than A:
Another name for the alternate hypothesis (Ha) is the “research” hypothesis, since the alternate hypothesis is what is what the researcher is really interested in showing.
3. Significance And The “p” Value
That brings us back to Dr. Fisher and his idea of significance.
Now that you have your null hypothesis and alternative hypothesis in mind, how can you prove one and disprove the other?
The sad reality is that you can never know for sure.
Since statistics by its very nature involves sampling a proportion of a population and making decisions based on how that sample behaves, you can never be 100% certain of your results. The huge difference between the poll numbers and the actual results in a primary election is a great example of that.
Dr. Fisher, like most of us, wanted to create a dividing line whereby you could say your experiment was successful or not. Thus, the rule of p-value of .05 or less to prove significance was born.
Don’t worry, it’s really not as confusing as it sounds. But I do consider this rule to be an error in judgment that we are still living with today.
Any type of test variable will have what is called a “distribution.” The distribution is just a representation of all the possible values the test variable could have, and their expected frequency.
For example, if you are looking into the difference in percentage between people preferring apples to people preferring oranges, the distribution could include anything from -100% to +100%, but those extreme “tails” of the distribution are much less likely than something in the middle, like 0% or 5%. That’s why most types of distributions have a bell shape, with the more likely values falling closer to the hump in the center.
Using an equation based on your sample size(s) and test results, you can calculate what is called a “test statistic” that tells you how far away from the center of your distribution your results are. This also tells you how far away you are from what the null hypothesis predicts, if it is true. This test statistic is then used to find the “p”-value from the z-table.
You can make it easier and use a significance calculator online like this one.
The “p” stands for the probability of your null hypothesis being true. If the number is small like this one, that signals a potential difference between test groups, since the null hypothesis is that they are the same. Graphically, this would mean your test statistic was closer to one of the tails of your bell-shaped distribution.
“If the p-value is low, the null must go.”
Dr. Fisher decided to set his own threshold of significant vs. not significant at p ≤ .05. In my opinion, there were a couple of problems with this idea:
- First, rejecting your null hypothesis does not mean proving your alternative hypothesis. This artificial dividing line is what created this common misconception. All “significance” really means then is that you cannot prove A and B are equal.
- Second, a p-value of .049, which would mean “significant” means that you still stand a 4.9% chance of rejecting the null hypothesis when it is true! That means your test results could be “significant” and wrong at the same time.
If it were up to me, there would be no dividing line at p=.05 (or sometimes p=.01 in other fields). Instead, we would look at our p-value on a case by case basis, and then decide if that probability is low enough for us to accept and implement our test results.
The most common way to perform a statistical test today is to set your significance threshold at p ≤ .05 before you even run the test. Just remember to look closely at the p-value when you review results, since it might have been very close to making it over the significance threshold (just over) and could be a strong signal in the right direction. It could also be so close to .05 (just under) that there is still a pretty good chance the result is incorrect.
In other words, significance is indeed something, but it is not everything.
4. Type 1 And 2 Errors
Over the years, the errors that stem from the use of the significance barrier have even been given names to describe them.
Type 1 Errors
As I mentioned earlier, a p-value of .05 means that you still stand a 5% chance of rejecting the null hypothesis when it is true. If you do reject it, that is called a Type 1 error. Your results might tell you that your new website enhancement improved conversion, and it was significant, but there is still a 5% (1 in 20) chance that this is not really true.
Type 2 Errors
A Type 2 error is the opposite of a Type 1: Accepting the null hypothesis when it is false. This is where your test results tell you that your enhancement did nothing to improve conversion, but it really did! Talk about your missed opportunities.
This kind of error is common in tests with inadequate sample size, since sample size improves the power of your test. Power is a pretty complicated concept, but the main takeaway is that a bigger sample size helps prevent this type of error.
5. Speed Dating With Computers
In this era of instant gratification, there are plenty of significance calculators available that let you cut to the chase.
Like a 5 minute speed date, you can easily find out if your new CRO brainchild was “The One” or just another “Mr. or Ms. Wrong”.
Don’t get me wrong. Tools like this are extremely useful, fast and convenient to use, but always keep in mind that there is a little bit more behind the numbers than meets the eye.
Perhaps no other statistical term holds the same power as the word “significant.” As I have witnessed firsthand, when the verdict is “not significant”, the aftermath can range anywhere from a new and improved experiment to a company shutdown.
Given the importance the concept of significance holds in the world of website testing, we owe it to ourselves to understand what it truly means. With an endless variety of test conditions, sample population sizes and success criteria feeding into our testing, knowing how we got to the answer, not just the answer itself, is essential.
Being an Evolutionary Biologist, Dr. Ronald Fisher must have had grinned as he coined the “significance means p-value is less than .05” rule, knowing that we primitive humans would continue to follow his lead for a century or more.
If this historic decision had never taken place, we might not be spending our time talking about significance in website A/B testing, or any other testing, today.
But I like to believe we would still be talking about significance when describing a hand-written poem, a scenic moonlit drive or a classic song.
True Significance: special, unforgettable, perfect.