How to measure the impact of an A/B Testing program

A/B Testing Program Overview

Below you will find an approach on how to better attribute and measure the impact of an A/B Testing program. If you find yourself asking any of the following questions, throughout your A/B Testing journey then you have come to the right place.

Problem Statements: 

  • What is the dollar value of an A/B Testing program?
  • How long should we attribute incrementality for “go-forward” winning experience running at 100%?
  • How can I make sure I am running meaningful tests?
  • We are running all these tests but I am not seeing it hit my bottom line. 

WHAT:

This document is designed to show the total incremental revenue gained/lost for all tests run. This approach uses revenue per session as the primary metric to determine incremental revenue.

HOW:

EXAMPLE MATH:

This is a winning A/B test that ran at 50/50 for 35 days. 

STEP 1 – CALCULATE REVENUE PER SESSION

Control RPS = (Control Sales: $2,081,976 ÷ Sessions: 62,000) = $33.58

Test RPS = (Test Sales: $2,181,976 ÷ Sessions: 62,754) =  $34.77

STEP 2 – CALCULATE AVG SALES LIFT 

Avg Sales Lift = ( $34.77 ÷ $33.58 ) -1 = +3.54%

NOTE: It’s important that this metric is at statistical significance( >95% confidence).

STEP 3 – CALCULATE THE COST OF RUNNING TEST

Explained: If the control sessions were to receive the test variant we would expect to get the same sales lift that the test variant received. 

Cost of running test = ( Control Sales: $2,081,976 * Avg Sales Lift: 3.54% ) =  $73,783

STEP 4 – CALCULATE MULTIPLIER FOR TRAFFIC SPLIT

Explained: This is more important for multivariate tests. Standard A/B or 50/50 test uses a multiple of 2x. This is needed to forecast what this experience will receive while running at 100%

Total Sessions = 62,000 + 62,754 = 124,754

Control Traffic Distribution = 62,000 ÷ 124,754 = 49.7%

Test Traffic Distribution = 62,754 ÷ 124,754 = 50.3%

Multiplier = 100% ÷ 49.7% = 2.01

STEP 5 – CALCULATE VALUE OF THE CHANGE IF ROLLED OUT 100%

Value = (Cost of running test $73,783 * Multiplier 2.01) =  $148,463

STEP 6 – CALCULATE VALUE GAINED FROM TESTING PERIOD

Value = (Cost of running test $73,783 – $148,463) = $74,680

STEP 7 – FORECAST INCREMENTAL REVENUE FOR EXPERIENCE AT 100%

Test Duration = 35 days

Avg Sales Per Day = ($148,463 ÷ 35) = $4,241

Projection = ($4,241 * 120 days) = $509,018

Explained: We recommend a 120 day forecast due to the always changing ecosystem. customer behavior, seasonality, promotions, site re-designs, development updates, and further A/B Testing make it difficult to expect consistent incremental lift for more than 120 days. You may also choose to run a winning test at 90/10 or 95/5 so you ensure that the experience is not negatively impacting KPI’s over time. 

STEP 8 – CALCULATE OVERVIEW PROGRAM VALUE

Program Value = ( Winning test + Losing tests + Cost )

Winning test: Steps 1-7

Losing tests: Steps 1-3 negative number. 

Cost: We found that it costs ~$8-10K per test after meetings, creative design, test plans, analytics, development, and QA.  For accuracy, we recommend that each program does their own cost calculations.

SUMMARY: 

As we start running more tests we will want to show that the program is profitable. Using this process we can keep track of our win/loss ratio, our incrementality, goals and more. This can validate that we are running impactful tests with meaningful hypotheses.

Not all tests will have RPS as a primary metric however we feel it should still be reported in the same fashion. This attribution model is not meant for all A/B practices however, all A/B practices should have some sort of program overview in place. 

Useful excel formulas

  • Statistical Significance

NORMSDIST(ABS( test avg sales – control avg sales)÷√ ( test std sales^2 ÷ test sessions + control std sales ^2 ÷ control sessions )

  • Confidence Interval

 CONFIDENCE.NORM (alpha 0.05, standard dev, size)

Power Analysis & Sample Size Estimation

Performing power analysis and sample size estimation is an important aspect of experimental design. Without these calculations, the sample size may be too high or too low. If the sample size is too low, the experiment will lack the precision to provide reliable answers to the questions it is investigating. In this case, it would be wise to alter or abandon the experiment. If the sample size is too large, time and resources will be wasted, often for minimal gain.

HOW

For each test, we will gather the following four data points before any code or resources are used. 

  1. What is the primary performance indicator (KPI)?
    • By how much do we believe our hypothesis will effect that KPI (effect size)?
      • Our standard can be 1%, 2% ,3%
  2. What is the acceptable minimum confidence level – 90%, 95%, 99% (significance level)?
  3. What is the acceptable minimum power level- 70%, 80%, 90%?
    • Often considered to be between .80 and. 90. 
    • Think of “Power” as the strength of the experiment. Statistical power is the probability that the test will detect an effect that actually exists.
  4. What is the current traffic size on the page being tested? 

WHY

With these data points, (effect size, sample size, significance level, power) we can enter three of the four quantities and the fourth is calculated. The basic idea of calculating power or sample size is to leave out the argument that you want to calculate. If you want to calculate power, then leave the power argument out of the equation. If you want to calculate sample size, leave ‘n’ out of the equation. Whatever parameter you want to calculate is determined from the others.

WHAT 

EXAMPLE Power Analysis for – Checkout – Guest Checkout

Hypothesis/Success Criteria: If we clearly call out the guest checkout option then we will increase conversion by at least 2%.

What is the optimal sample size for the given hypothesis? 

  • Sample Size (n)  = Unknown?
  • Effect Size (d) = 2%
  • Power = 80%
  • Sig Level (alpha/confidence level)  = 0.05 or 95%

Sample Size = 19,625

This tells us that we need ~20k sessions to reach 95% confidence to see a 2% increase in conversion at an 80% probability that the detected lift actually exists. If we do not reach a 2% increase in conversion at 95% conf. in the optimal sample size then we failed to reject the null hypothesis.

If we met the 2% increase in conversion rate at 95% confidence in ~20k then we would have rejected the null hypothesis. 

Power Analysis for – Checkout – Guest Checkout – Current Results

What is the Power of our current test results?

  • Sample Size (n)  = 30,946
  • Effect Size (d) = 1.8%
  • Power = Unknown?
  • Sig Level (alpha/confidence level)  = 0.12 or 88%

Power = ~90%

This tells us that there is a 90% probability our test we will be able to detect a change.

However, there is only an 88% confidence level in that change. 

What do we do? 

  • We could accept 88% as “good enough”. 
  • We could re-run our power analysis with a smaller effect size. This will increase the sample size needed. Continue running the test.