AB testing 101
Martin Betz • September 18, 2020
AB testing is easy to understand and hard to master. The concept is easy: Form a hypothesis about something in your system and test your current implementation against a treatment. But you can easily get lost in a sea of definitions and poorly set up tests and hard to interpret results. They will not give you a clear winner but only numbers you cannot make sense of. Let's walk through an example together and I will tell you the bare minimum you need to know about AB testing when you are a product or tech person. You probably will not learn anything new as a data scientist.
Contacts: Example with all definitions
Take this example of two contact cards. You have a contact card with one call-to-action "Contact". We use a generic avatar in the original version. We call the original "control". Our hypothesis is that we can increase the rate of people clicking on the "Contact" button - often called conversion - by replacing the generic picture with a real profile picture. The version with the real image is called "treatment".
We now split the traffic on our website in two groups, say 70% see the control and 30% see the treatment. If a higher rate of people click on the contact on the treatment, we choose it and replace the control implementation. We call this: Better conversion.
The exact probability that our result just happened by chance is called significance. It is defined by the sample size (Visitors on control and treatment are each a sample) and the variation of the groups (the more similar they are, the better). The more people are in one sample, say 70.000 in the control group and 30.000, the better. The second factor is sample variation: It's better to have the same distribution of people within both samples. For example it would be bad if everyone in the group of 70.000 was not familiar with technology and would prefer offline contacts and everyone in the 30.000 developers who like to do everything online. So, in summary, higher sample sizes and less variation in samples increases the validity of your result.
This is called significance. The significance is described as a percentage, such as 95%. That's the sweet number to be sure that the result did not happen by chance.
So what do you need to set up the test and what to take care of.
What you need for an AB test…
- A big sample size (several thousand visitors per sample are needed)
- A clear goal (clicks on the contact button, measured by one Google Analytics event e.g.)
- A hypothesis ("Real image will increase conversions")
- An improvement goal ("We will accept as new default, if the conversion increases by 5+%, otherwise we will reject because there are follow up costs" or "We will accept as new default if any increase in conversions")
- A second control sample group (exact same as the control, this is only needed if you are unsure about your sample variations or setup)
How to set up your test
You don't need to know lots about statistics. Tools like Google Optimize will do all the heavylifting with numbers for you.
You only need sample groups (defined e.g. as "All visitors on your page"), a split (70% on control, 30% on treatment), a clear goal (e.g. "Clicks on button - which triggers Google Analytics event
button_click_contact or a goal that you defined in Google Analytics). Google Optimize will calculate the expected goal (in our case the conversion) and the probability that the treatment will beat the original.
This is an example result from a test where the treatment beat the original:
- Can I run multiple treatments against one control?
- You can, if you have big enough sample sizes
- You may not want to because it will take longer and the numbers get harder to read
- Can I change control or treatment during the test?
- No, instead stop the test and run afresh
- Are 2 weeks enough for a test to run?
- No, there's no fixed time. Only significance of 95+% is important
- Are 2000 visitors enough to test hypothesis?
- No, there's no fixed visitor size. Only significance of 95+% is important
- Can we test multiple chances at a time?
- Yes, but analyzing the numbers gets harder