Getting valid results from your A/B test

A/B testing is a great way to let actual users respond in real time to different versions of copy, button colors and shapes, CTAs, etc., and finding out which version is the most persuasive. And it seems so easy, even the executives would understand the results. What could be simpler? More people clicked the red button, therefore, red buttons are better for conversions.

Well the road to UX hell is paved with quick A/B test results. Before you get too confident, let’s take a look at some of the factors that you need to consider before declaring a winner.

Statistical Significance

Statistical significance is all about whether the difference between two numbers is meaningful or just a fluke. 95% significance means you are 95% sure that ‘B’ is better than ‘A’.

So statistical significance tell us that it’s better, but it doesn’t say anything about how much better.

That’s where confidence intervals and margin of error comes in. You need to wait until your confidence intervals are high enough to show an improvement.

For example:

If 10.3% is the improved conversion rate then this value is the mean.

If there is a ± 1.0 % margin for error, then this gives us a confidence interval spanning from 9.3% to 11.3%.

If the confidence intervals of your original page (A) and your variation (B) overlap, then you need to keep testing (even if your testing tool is saying that one is a statistically significant winner).

Avoid Overstated Data Claims

Many CROs go wrong right around here in the way they present test data to stakeholders/clients. You should avoid overstating or over-promising based on any test results. You need to remember that what your test shows is how a sample of your total population has reacted to the variation. In this example, the improved conversion rate of 10.3% is only for that sample who came to the site during the test period. You cannot assume that it will translate directly to your total population.

We have ample evidence to the contrary. There are many factors that affect conversions, including environmental changes that have absolutely nothing to do with your site. For example, the day you launch the redesigned site, the client’s competitors may release a whiz-bang new product that crushes the market and tanks their conversion rate.

So you can say that FOR THE TEST the conversion rate was improved by 10.3% and that you believe that by implementing the variation there should be an improvement. But you can’t say by implementing the variation you are 95% sure their actual conversion rate will increase by 10.3%. That’s the mistake of a novice CRO. Don’t go there.

Variance and Errors

Thanks to variance, there are a number of things that can happen when we run A/B tests. (Variance is the calculated difference between your sample and your total population.) If you don’t understand this, you can be tempted to declare a winner too quickly.

Test says B is better & B is actually better
Test says B is better & B is not actually better (type I error)
Test says B is not better & B is actually better (type II error)
Test says B is not better & B is not actually better

A type 1 error is caused by not waiting until a test has reached statistical significance.

A type 2 error is caused by not having enough statistical power (not waiting until you have tested against a large enough sample size).

When it comes to A/B testing, patience is definitely a virtue.

6 Tips for Getting Valid Results

Don’t Call Tests Too Early

Have a big enough sample size
Run for at least two full business cycles
Calculate your sample size and run rate before you test
Understand significance, confidence level, margin of error

Don’t Run Tests for Too Long

Calculate your sample size and run rate, and then trust your calculations. Call the test when you have valid results.
Don’t get attached to your new design — many times the reason people keep a test running after getting valid results is because they are trying to get data that will prove the new design really is better.
If the test hasn’t reached significance but you’ve run it for two cycles and hit your sample size don’t leave it running. If you change your protocols by extending the tests for months till it finally tells you something, you can’t trust it.
Remember the longer you run your test the more data pollution you will get.

Don’t Test for the Sake of Testing

Some CROs feel duty-bound to have a test running at all times so they can look like they’re doing their job. This can lead to sloppy testing, hastily conceived tests, and silly outlier tests.

Always test data-driven hypotheses. Otherwise you won’t learn anything, and you won’t know where to go from there.

Don’t Overlap Tests

You are after clean, honest results. Avoid tests that could give you results that are ambiguous, hard to trust, and impossible to defend.

Don’t compromise your results to increase your velocity
Test mutually exclusively
Or test with different segments (user type, device, channel, country)

Validate your Data with More Data

Run A/A tests. This is where you set up an A/B test in your test tool, but both versions are identical. This is usually done when you are evaluating a new testing tool, but can also help you set the baseline conversion rate for your site. Read more here.
Check your test results in Analytics
Segment your data. But be sure your segments are statistically significant too. You can only segment your data if the segments have reached significance too.

Check for Data Pollution

Make it a part of your process to always question the data integrity against pollution factors such as:

External factors – Holidays, press releases, promotions, etc. can make your visitors behave in atypical ways, which aren’t predictive of their normal patterns.
Tool/tag factors – implementation issues, broken code, flicker effect (Flickering is when an original page is briefly displayed before the alternative appears during testing. This happens due to the time it takes for the browser to process modifications. It can influence the visitor behavior if they see the alternate version and realize they’re in a test.)
Selection effects – if you run a test on new visitors only, don’t assume it will work with return visitors

Bottom Line

Don’t forget, testing is always done with a sample, and thus you should not ever assume perfection. Even at a 95% confidence level, errors can occur. At that confidence level, 1 in every 20 tests will commit a false positive.

This is why it’s so important to follow a system to ensure you are stopping tests when they have truly reached significance and the sample size is big enough to represent the real population. It’s also why we need to be aware of any outside influences, validate our findings with other data and even re-run important tests. And remember not to over-promise on your findings, whatever they may be.