When I first started designing experiments to optimize conversion rates at Magoosh Test Prep (the company I work for), I read a bunch of case studies for inspiration, assembled a list of so-called “best practices” that I wanted to try, and naively thought that A/B testing wouldn’t involve any knowledge beyond statistics 101. It took me almost a month to realize that I was terribly wrong … about everything related to A/B testing.
Case studies are not helpful unless you truly understand the procedures and motivations behind them. There is rarely any transferable wisdom from one product to another. And it involves more than plucking in a few numbers and getting a p-value from some software.
I named this article an anti-cookbook approach to A/B testing, because I was tired of seeing statements like “big authentic photos really help with conversion”, or dubious claims like “shorten your landing page above the screenfold and you will see 40% improvement”. There are far too many examples like these if you search for A/B testing online, and I really didn’t want to contribute another generic article to the pool.
So, how is this article different?
I’d like to help you form your own testing strategy. A good testing strategy has to be targeted to your problem, grounded in the ecosystem of your product — ruthlessly prioritized among many reasonable options, and robustly executed at the right time for the right reasons.
First, let’s talk about bad strategies. If you find yourself in any of these three situations, you are not on the right track:
- You are proposing an experiment because someone else has done so and achieved an incredible result.
- You are running an experiment to defend a hypothesis that feels right.
- You’ve frequently found winners after running experiments that seem too good to be true.
If any of those describe the experiment(s) you’re running right now, stop them and read ahead. Hopefully, the rest of this article will help set you on a more productive track.
Before you dive right into it your next experiment, I’d like you to ask yourself one important question.
Do I need to A/B test this?
There are three actions you can take once you have an idea:
- Just do it
- Try it now and evaluate it later
- A/B test it
Each option addresses different problems. These days, when everybody is talking about A/B testing, it’s crucial to take a step back. The table below explains how I understand A/B testing fits among the three actions.
Once you can confidently place your idea in a testing bucket, how do you proceed from there?
Determine whether you’ve designed a successful test
A successful test design meets four criteria:
1. You can articulate and defend the motivation behind your test
This is more difficult than it sounds. A motivation should be more than a feeling. Try this line of thinking: if your hypothesis were true, what would be the behaviors that you should probably be able to see?
For example: At Magoosh, we were trying to improve conversion rates by experimenting with our product offerings for the GRE. One idea was to highlight the 1-month study plan on our pricing page (we previously highlighted subject-based plans, e.g. GRE Math only plan, GRE Verbal, etc.). If highlighting the 1-month plan were to effectively increase conversion rates, we knew we could expect to see some of those behaviors:
- A good proportion of students only study for one month even if they have a longer plan.
- Students explicitly ask for shorter plans (for a cheaper price) through customer support.
In our case, one thing that made the hypothesis especially convincing was that we actually had a link to the 1-month plan buried in the FAQ section on our product page. It wasn’t very visible, but people were still buying it, more so than they were for some other plans, so we went ahead and highlighted the 1-month study plan on our pricing page. It proved to be a winner.
It’s crucial to thoroughly walk through the “if so, then” step after you have a hypothesis. Here’s a good blog article from RJ Metrics that essentially makes the same argument.
2. Your test can move the metric for a very large or impactful segment of the population
A feature almost always works differently for distinct segments of your customers. Whenever we have a hypothesis at Magoosh, we stop to think who might respond most favorably to the new feature we’d like to test.
For example, we are trying to improve our Net Promoter Score (NPS) for our GMAT product. One area we thought we could improve was our mock tests. The tests used the same question bank as the regular practice questions and quizzes that students completed as they progressed through our GMAT course. We got complaints from people seeing repeated questions in their mock test, so we thought why don’t we separate a pool of questions that will only be accessible by mock tests?
If we rolled out this new feature, it would only help improve NPS through people with this profile:
- Heavy users who use a large percentage of the question bank
- Those who take a mock test after they’ve worked through many practice sessions
- Those who don’t benefit from seeing repeated questions and are currently detractors
We estimated that these users represented less than 10% of our users. Because of the limited reach, we ultimately decided it wasn’t worth pursuing the new mock test feature at that time.
3. Your test is simple
There is a reason you are A/B testing a change: you are not committed to it yet. Build it as fast as you can. Try listing out all your viable experiment options and choose the simplest one.
For example, we thought it was worth it to try improving our NPS score by providing better diagnostics to students (e.g. information about subject areas they need to work on, highlights of their strengths and weaknesses, how much work they should put into certain areas before their tests, etc.). There are many options we could adopt to test this idea. Roughly speaking, they fell into one of the four categories below (A-D):
After weighing the options, we decided to take the middle ground between A and B: Do the calculations online and in-product, but prompt students with an email once a week to monitor their progress. We settled on this decision because we wanted to scale it fast to speed up the feedback loop. Nudging them weekly provided extra encouragement to try out the feature, and the development effort was still very manageable with this option.
While option C would seem attractive on usability, it’s incredibly hard to do well and incurs high development effort. I’d need much more validation for the feature before adopting that option.
Sample Progress Report
4. You have a clear sense of what a reasonable testing result would be
At the beginning of your experiment, it’s helpful to establish a baseline of key metrics that you’ll be working on and to set a range of acceptable fluctuations for those metrics.
For example, our conversion rate is usually 5% +/- 0.15% for one page in January for one of our exam products, and it peaks on Wednesdays each week. If all of sudden, we were to see a 9% conversion rate in one version of an A/B test on that page, we would be worried. Chances are it wouldn’t be an outstanding winner that we were seeing, but that our experiment was contaminated somehow.
This has actually happened to us before, we would see an unusually high revenue number for a variant, and then realized that it was coming from bulk sales that severely skewed the numbers. We now tease out those outliers.
There are countless other ways that non-experimental factors can affect your experimentation. The best way to fend those off is by knowing precisely the dynamics of your metrics across time.
Evaluating your test results
You started an experiment and let it run a few weeks. Now what?
Here’s where we get to talk about the science behind hypothesis testing. I’d like to highlight a couple of areas that I see many people miss out on.
Using a classical frequentist approach
Most people take this approach. Here is a breakdown of how to do it with some tips.
Step 1: Estimate the effect size that’s meaningful to you.
Even if you don’t know what the actual effect size will be, you still need to do this. That’s why doing pre-test estimations is important. And it helps to shoot for big wins, as you’d get significant results more easily.
Tips: Depending on how big your revenue/profit is, you may already have some understanding of a meaningful effect size. If you manage a product that generates 5M dollars, then improving your conversion in the product by 10% is a big deal. Anything less than 1% is probably not worth it.
Step 2: Pick an alpha value (usually 0.05) and beta value (usually 0.2) to figure out the sample size.
Step 3: Run your experiments.
There’s no need to pay attention to the experiment results until you reach your pre-calculated sample size (with the exception of terribly bad experiment results). Again, if you form a good sense of what the testing results should be before you run the experiment, you’ll know what’s terribly bad.
Tip: Watch out for false positives. The reason why we need to commit to a sample size before running an experiment is that it’s too easy to cherry-pick the testing result you want to see and end your experiment prematurely.
A more academic way to view this issue is that the more you are tempted to peek into the results dashboard and end the test based on the performance at that moment, the more likely you are to see a false positive, or run a higher risk of underestimating your true p-value. Many people have discussed this matter before, like Evan Miller and AirBnB on their respective blogs.
The intuition is simple: Consider the experiment of tossing three coins. It’s a somewhat unlikely outcome to see 3 heads, as the probability of this is only 0.125. However, if you do this 100 times, you’re almost guaranteed to see it happen, maybe more than once.
Step 4: Declare the winner, or the lack of any winners
If by the end of the test, it has yielded significant results, then you have your winner. If you don’t see any significant results, you don’t have a winner. Plain and simple.
Using sequential data and simulations
Some people don’t like the classical frequentist approach. Here are a few reasons they might dislike it, especially if they work at a startup:
- Data comes sequentially overtime and not in batches. After all, who wants to be blindfolded while the test is running?
- Time is precious. If possible, some would rather end the test as soon as they have confidence that one is winning, rather than wait until they reach the predetermined sample size.
- Lots of things are uncertain. It’s very difficult to know what the expected result is for a particular variant. And you’d need this info to calculate sample size to begin with.
Sequential analysis offers a very attractive alternative because, as it allows you to evaluates results as they come in. With sequential analysis, you set a bar (this can be a dynamic p-value or some effect size parameter), and as soon as your experiment generates a result that surpasses this bar, you can declare a winner. Please check out this article by Evan Miller. He does an excellent job explaining the matter.
It is intuitive to use different bars at different stages of the experiment. When you first start an experiment, you don’t yet know what the true effect size will be, so you are probably declaring a lot of fake winners in the early days, only to find that they reveal their true identities later on in the experiment. As you gather more information, the chance of you making this mistake also lowers. One way to control this bias is by setting a p-value bar that is lower (stricter) than you normally would if you had patiently waited until the sample size met your pre-calculation.
Apparently, AirBnB takes the same approach: They dynamically change the p-value, making it much harder to declare a winner in the early days of experimentation.
The good news is that you don’t even need to do a lot of math to figure out your dynamic p-values. You can run some simulations.
- Imagine Versions A and B have the same metric, let’s say conversion rate. Simulate a realistic number of visitors and payments every day in the next 30 days based on your historical data and forecast.
- Run 10K such simulations, such that you will have 10K pairs of conversion rates of Versions A and B in each of the 30 days.
- Find the distribution of p-values when you run a statistical test for each of the conversion rate pairs for each of these 30 days, one day at a time. In other words, you generate 30 distributions of p-values, one for each day.
- Find the p-value thresholds in each of the 30 distributions that correspond to a false positive rate of 5%.
These thresholds, which increases as days of the experiment does, will be your dynamic p-values.
Bayesian approach to A/B testing, and good old graphs
I really struggled with whether to include a section of Bayesian experiment design in this article. I think there are a lot of advantages of using it:
- If you are not afraid of doing some integrals and algebra, the bayesian estimators are more intuitive than frequentist estimators.
- Its design allows you to deal with the “peeking” issue to some extent and is more robust to disruptions in your experiment.
Unfortunately, the cost of computation and, more importantly, the cost of communicating it such that every engineer, product manager, and marketer that you work with understands the methodology is high.
In our own experiments, we don’t currently produce bayesian results. That’s something that I would like to adopt as we grow bigger and run more experiments with more complications.
I’m not going to dive too deep into this. If you are interested, take a look at this article from Variance Explained.
What I would suggest is that you always produce a chart of percentage improvement over your control version in your key metric, and if you feel like it, you can also plot the p-values at each time point, something like the graph below.
I’ve found doing that really empowers the teams I work with. It’s very simple, and it helps people develop an empirical sense of how wild things can be at the earlier days of the experiment, and how they eventually settle down.
Preparing for the unexpected
Reality is messy. Much as we’d like to have a clean experiment, we often don’t. Here are a few things to note:
- Different business cycles may bring in different audiences. Before you declare a winner, consider whether the participants in an experiment at one time are similar to those at a different time of the year. For example, if most of your users come from organic search in one experiment, and you are vamping up your paid marketing for the next quarter, what seems appealing at the moment may no longer be the right move. Unfortunately, in this situation, all you can do is re-test.
- Sometimes your early experiment results could be especially wild. It’s like a shock effect. For example, we have a lot of return users for our GRE product. When we roll out a new design, sometimes people get upset because they are not used to it. If you take a look at all your users, you metric might be very bad. If you take at only new users alone, it might be more comforting. When you are working on something long-term and you have a replenishing pool of new users, it’s better to look at only new users.
- Watch out for the interaction effect. Sometimes the performance of one experiment may affect that of another experiment. Bundle them up if they are reinforcing each other. Pick one if they detract each other.
Before I conclude this article, I’d like to emphasize the importance of having a healthy attitude toward your experiments. At the end of the day, you are just experimenting. Many experiments fail and that’s okay. I strongly recommend that you build a list of different experiments that you’d like to run so you have many options to choose from. This prevents you from getting too emotional with your test.
To strike the balance between running an experiment the rigorously and running it fast is something that I’m still learning to do as well. Your A/B testing environment should evolve as your company grows. Here at Magoosh, our next steps are to make sure there is less leakage across experiments, to flesh out more robust statistics, and to build a system conducive to knowledge sharing. It’s a long process, but we’re very excited about its prospects. And I’m excited to see whether these tips work for you.
About the Author: Sam is the Data Scientist at Magoosh Online Test Prep and loves examining data to help roll out new features that make test-takers’ lives easier. He has an MS in Development Practice from UC Berkeley, and a BA in Management from Renmin University of China. In his spare time, he enjoys playing tennis and singing along with his electronic keyboard.