Book Summary: Trustworthy Online Controlled Experiments

15 min readApr 19, 2021

Photo by greg flessing on Unsplash. Hippo is a reference to the HIghest paid person’s opinion as opposed to hard evidence from online experiments.

This is a summary of the book Trustworthy Online Controlled Experiments which consists of 5 sections touching different aspects of AB testing for various audience.

Part I: Introductory Topics for everyone
Part II: Selected Topics for everyone
Part III Complementary and Alternative Techniques to Controlled Experiments
Part IV: Advanced Topics for Building an Experimentation Platform
Part V: Advanced Topics For Analyzing Experiments.

This is a wonderful text full of hard-earned lesson from decades of experience and real case studies. I would recommend anyone involved or interested in A/B testing to own a copy!

What this book is about:

Conduct Trustworthy Online Controlled Experiments whenever you can for your online product.
However, problems do occur in the design, instrumentation and implementations of experiments which undermines the validity of the experiments.
This book contains hard-earned lesson from decades of experience on how to ensure your experiments are trustworthy while scaling your operation.

Key Takeaways:

Good decision making relies on quality information, when Randomised Controlled Experiments (RCE) can be conducted reliably, they offer the highest quality of information.
There are countless ways that an experiment can go wrong. To ensure the validity of an experiment, we need to rigorously examine the design, instrumentation and implementation of the experiment.
The faster and more experiment you can run, the faster you can move. Develop standardised and automated practices can help you scale your operation.
Your metrics act as your compass, if they do not point towards your destination (organisational goal), you will not reach your north star. Choose short-term metrics that lead to changes in your long-term metrics and ensure metrics from different teams are aligned.
Ultimately, you are trying to learn a causal model of how your action can steer the product as a whole. Institutionalise historical tests into memory can help you predict the direction and size of your action, this will lead to better risk mitigation and selection of ideas to test.

This is a summary of the chapters associated with specific landscapes of A/B tests.

Theory and Experimental design

The whole part 5 is devoted to this topic, chapter 2 covers factors that can undermine an experiment.

Chapter 10 and 11 covers complementary and alternative techniques to controlled experiments.

How to design metrics

Chapter 6, 7, and 21.

Platform and software implementation

Chapter 4 and additional details in part 4

Building an Experimentation Culture

Chapter 1 provides the background, chapter 8 on institutional memory while chapter 9 covers ethics.

Chapter Summary:

Chapter 1: Introduction and Motivation

Why should we do online controlled experiments?

The better the quality of the information we base our decision on, the more likely we are to reach the desired outcome.

I have made a slight change to the pyramid, but the idea is the same. When we have evidence available from the higher level of the pyramid, we should adopt them ahead of other lower-ranked information.

The slight difference between the above period is the separation of observation studies into exploratory analysis and causal modelling. Causal modelling is considered to be a higher quality of information as it controls for the effect of multiple forces, and provide a measure of the type and strength of the relationship.

My understanding is that the higher the quality of information, the more they should:

Reduced influence from noises, bias and confounding effects.
Improved generalisation and accuracy.

Randomised Controlled Experiments is only superseded by multiple RCEs, so conduct them whenever you can reliably execute them.

Requirements for Online Controlled Experiments to be successful:

The organisation is data-driven and has an Overall Evaluation Criterion (OEC).
The organisation is willing to invest and develop the necessary infrastructure.
The organisation recognise that it is bad at assessing the value of ideas and that RCEs are the best way to make a decision.

Although experiments are useful for the following items, it, however, can not help you come up with the right strategy.

Identify the ROI of each action.
Detect small changes such as font size that can have a significant impact.
Optimise backend algorithms.
Compare various hypothesis and pivot your business.

Chapter 2: Running and Analyzing Experiments: An End-to-End Example.

Designing the experiment: What is the randomisation unit, how long should the experiment run, what test and metric should we use?
Running the experiment and getting the data: Set up logs and infrastructure to collect data.
Interpreting the results: Is the result valid and make sense?
From results to decisions: Should we make the change?

Chapter 3: Twyman’s Law and Experimentation Trustworthiness.

Factors undermining the validity of an experiment.

There are many factors that can undermine the validity of an experiment, and when something looks too good to be true, the chance of something wrong is high. Threats can be classified into two categories:

Internal Validity:

This class of threat is associated with the design and incorrectness of the experiment. Common offenders are a violation of statistical assumption such as Stable Unit Treatment Value Assumption (SUTVA) or Sample Ratio Mismatch (SMR).

External Validity:

While external threats refer to the generalisation of the result beyond the study such as segment or time. An example would be to generalise the result of customers in the USA to Egypt, or sales lift from Christmas to non-holiday period.

Chapter 4: Experimentation Platform and Culture.

Do you need an experimentation platform? If so, how to build it.

4 maturity levels of AB testing:

Crawl: Starting to conduct AB testing (10 tests/year)
Walk: Improve speed, practices and implement safeguards. (50 tests/year)
Run: Scale the operation (250/year)
Fly: Institute organisation memory (1k+/year)

Build vs Buy

Whether to build or purchase is a tough decision for any software adoption, there are pros and cons for both approach and the justification depends on the company culture and the ROI. Building your own platform is most beneficial when you have sufficient learning, an established process and the intention to scale.

An experimentation platform has in general four high-level components, these can be built independently but in order for interoperability, they should be designed with each other in mind.

UI/API for end-users to define, manage and configure the experiments.
System to deploy experiments and provide assignment of the experiment.
Logging facility to capture the user action and system performance.
Analytical engine to compute the test and summarise the results.

Chapter 5: Speed Matters

Don’t let anything slow you down.

Change in latency can move almost all metrics.

Chapter 6: Organizational Metrics:

How to design good metrics

Taxonomy:

Goal/Success/True North Metrics
Driver Metrics
Guardrail Metrics

Goal Metrics

Should align with the long term strategy of the business and should be simple and stable.
May be difficult to move in the short run and thus not ideal for testing.
An example would be revenue per customer or lifetime value.

Driver Metrics

These should align with the goal, actionable, sensitive and resistant to gaming.
The purpose of these metrics is to direct actions toward moving the long-term goal metrics.
Frameworks that help with developing a suite of driver metrics are HEART, PIRATE and user funnels.

Guardrail Metrics

The purpose of the guardrail is to either protect business interests or to ensure the validity of the experiment.
Latency is a common guardrail metrics since it significantly impacts all other metrics and reduces revenue.
Another type of guardrail metrics is to ensure that the traffic directed to the variants are as expected (Sample Ratio Mismatch).
Pageview per user is also a common guardrail metric since changes in this indicates a shift in almost all other metrics, as pageview and user counts are the common denominator in other metrics.
Metrics that are not expected to change for the implementation.

Chapter 7: Metrics for Experimentation and the Overall Evaluation Criterion

One metric to cover multiple directions.

Although many texts promote the idea of a single metric optimisation, they are often insufficient in directing the business in the right direction. Overall Evaluation Criterion (OEC) is an approach to combine multiple metrics with the tradeoff in mind to capture the overall business gaol.

Take a subscription service for example, instead of maximising only the subscription rate, we should also consider the churn rate.

OEC = 0.6 * subscription rate + 0.4 * (-first day churn rate)

When the metric is not a rate, normalise them to the range 0–1. This metric protects us from acquiring customers that are of single usage and is better aligned to the business goal of increasing the number of long term subscribers.

Chapter 8: Institutional Memory and Meta-Analysis

Convert tests into knowledge accessible to everyone.

One key outcome of online experiments is to learn a causal model about the impact of the changes we make and how we can repeat them.

The more tests you conduct, the more you have confidence in the relationship. This is the essence of meta-analysis, it is a systematic study of multiple tests.

Chapter 9: Ethics in Controlled Experiments

As a data scientist, I have always been inspired by the Airbnb data science motto

Data isn’t numbers, it’s people.

Yet too often as data scientists, we often neglect that it is human that generated the data. To be ethical is to have empathy and avoid conducting experiments that you would not want others to conduct on you.

You would not want a social media company manipulating your emotions by intentionally showing you more positive/negative emotions.

You would not be happy when someone paying the same price is getting a better product/experience.

Your product can affect the livelihood of others. Finding employment can be stressful, imagine the job market platform excludes you from your dream job in their experiment.

The 3 guiding points of our conducts are Respect, Beneficence and Justice.

Chapter 10: Complementary Techniques

How to come up with ideas.

Coming up with ideas may be difficult once the product matures. The following complimentary techniques provide a new flow of ideas known as the idea funnel.

retrospective analysis
human evaluation
user experience research
focus groups
survey

Chapter 11 Observational Causal Studies

What to do when you can’t run a RCE.

Condition for randomised control experiment may no always be available, have alternative tools for drawing causal inference is useful. These techniques are known as observational causal study or quasi-experiments.

Interrupted time series

The idea is to alternative between control and treatment so that the treatment effect can be estimated.

Propensity score matching

Match individual based on their common confound, then proceed to analyse the result as if the data comes from an experiment.

Difference-in-difference

Create a counterfactual based on a different geographic region/segment.

You can also build models to predict the counterfactual, that is, what would happen if the treatment was not conducted and then calculate the difference.

Chapter 12: Client-Side Experiments

Considerations and implication for mobile/client-side implementations.

This chapter focus specifically on considerations for thick clients such as mobile where a significant amount of code is shipped with the client as opposed to a thin client like a web browser.

The two primary considerations when dealing with a thick client are 1) the release cycle and 2) data communication between client and server.

Excluding the development of the app, there are three steps in the mobile app lifecycle.

Submit the app to the store for review.
Review completed and the app is published.
The user download/updates the app.

This delay between the date of submission and the time the user updates the app results in the following implications:

Since changes can not be made easily and timely, changes need to be anticipated carefully. The risk of faulty variant can be considerable. When possible, try to have configurations set on the server-side while having defaults shipped with the app.
Experiments can start at a different time between treatment and control as users may update the app at different time.
Adoption bias, there may be differences between people who frequently update their app and those who don’t.

Communication of information is another important aspect that we need to consider for thick clients:

The client may not always have a connection to the internet, so default settings/parameters should be available at launch time. In addition, this means that that there will be a delay in the time of experimentation and logging time.
Collecting and sending data on the client device increases the usage of data, battery and hardware, and slows down the performance. This can result in poor experience and uninstallation.

Chapter 13: Instrumentation

Understanding the environment the user and the system is under.

No pilot would fly blindly without critical instruments like an altimeter or compass. Neither should we.

Instrumentation is key to detect internal threats to validity and should be part of the initial specification of the feature. We should collect data on both the user and the system to ensure that the experience and the system are operating as expected.

Examples of instruments include the number of crashes user experiences or the latency of the system.

Chapter 14: Choosing a Randomization Unit

How should the allocation of variant be determined and the various proxy for user-level tracking.

There are two primary considerations when choosing how the variants are assigned.

The first and foremost consideration is the user experience. Take an extreme example where you want to test different font colours for your site, the user experience would be terrible and appear unprofessional if the font colours are different for each page for the same user. The result of the experiment can be undermined if the user can detect the variant and behave differently.

The second point to consider is the metrics you are calculating and the unit of analysis. User-level metrics are meaningless if the unit of randomisation is at the query or page level. If you are serving two different search algorithms to the same user, then the session/user conversion rate would be meaningless and the treatment effect should be zero.

In addition, you will be violating the Stable Unit Treatment Value Assumption (SUTVA), which can undermine the validity of your experiment.

Chapter 15: Ramping Experiment Exposure: Trading Off Speed, Quality, and Risk

Trade off between speed, quality and risk of ramping up experiments and the four stages of ramping up experiments.

Speed <-> Quality:

Certain behaviour requires time to learn, by ramping up the experiment too fast we may miss the learning opportunity and result in a suboptimal decision.

Speed <-> Risk:

The faster you ramp up the experiment, the greater the level of risk of unexpected adverse effects. For example, your new API may not scale fast enough to handle 100% of the new traffic, or perhaps a campaign has an extremely negative impact on the revenue.

Four ramp phases:

Maximum power ramp (MPR) is the allocation that gives the maximum power by assigning 50% of the traffic to each of the control and treatment.

1: Pre-MPR: The purpose of this phase is to mitigate risk and test the new features on a small “ring” of users who have been whitelisted or beta users.

2: MPR: This is the phase where we measure the impact of the experiment.

3: Post-MPR: A transition phase to address any scaling issues.

4: Long-term holdout: A phase dedicated to learning long-term behaviour. Not a necessary step but useful for features where there could be short/long term differences, or you want to learn the impact on a longer-term metric such as the one-month retention rate.

Finally, there should be a cleanup phase to avoid previous implementation being triggered in the future.

Chapter 16: Scaling Experiment Analysis

How to move from the “walk” phase to “run” and “fly” phase of experimentation, through standardisation of process and tools.

The chapter points out three components required for an experimentation platform.

Data Processing — sort, group, cleaning of data.
Data Computation — Compute metrics and summary statistics.
Results summary and Visualisation — Scorecards, dashboards for decision making.

The key is to automate the process and create standardised processes to reduce decision making.

Chapter 17: The Statistics behind Online Controlled Experiments

Statistical tests, P-value, power etc

This is an introductory section on the fundamentals of statistical hypothesis testing.

The author also offered a very simple solution to multiple hypothesis tests. When you have multiple metrics, group them into the following category with the adjusted threshold:

First-order metrics (0.05): Those that you expect to be impacted.
Second-order metrics (0.01): Those potentially impacted (e.g. cannabalisation).
Third-order metrics (0.001): Unlikely to be impacted.

If you have multiple tests on the same hypothesis that does not have enough power, you can use Fisher’s Meta-analysis to improve the power and combine the result of the tests.

Chapter 18: Variance Estimation and Improved Sensitivity: Pitfalls and Solutions.

Counter to intuition, outliers often reduces the sensitivity of the test. This is because although outliers increase the treatment effect, the increase in the size of variance is even greater, thus rendering the test insignificant.

You should always investigate the source of the outlier, but removing them or capping the metric can improve the sensitivity.

Improving the sensitivity of an experiment means that we can conclude the test earlier or detect smaller changes. Reducing the variance is the most common approach:

Choose a similar metric that has a smaller variance. For example, the percentage of people purchased is more sensitive than the total amount of purchase. You would want to use the average purchase amount as a guard rail metric.
Another technique is to transform the metric. Capping, binarisation or log transformation are all effective in reducing the variance.
Another family of technique focus on reducing the variance by reducing/eliminating between group-variance, in another word “explained variance that existed prior/independent to the experiment”. Stratification, control variates and CUPED all belong to this family. In stratification, instead of calculating the variance from the complete sample, you pool the variance from each stratum (e.g. mobile, desktop, tablet) together. This post from booking.com provides an excellent illustration of CUPED.

Chapter 19: The A/A Test.

The gold standard to ensure your result is trustworthy.

An A/A test is basically an A/B test where A = B. It can detect both the violation of the statistical assumption and problems in implementation.

Is your Type I error (Alpha) really 5%? This can point to a violation of multiple assumptions and the validity of the whole experiment.
Is the traffic really the percentage you have assigned?
Is there bias between the control and treatment? This could hint at a faulty implementation of assignment or residual effects from previous experiments.

In addition, you can use the A/A test to estimate the variance and assess the variability of the metric.

Chapter 20: Triggering for Improved Sensitivity.

Testing only on the sample of interest, the less you pollute the sample, the cleaner your answer will be. New traffic is always noise, the more noises that you can exclude from the experiment, the more reliable and more power your experiment has.

An example of triggering is to only include traffic reaching the checkout page if you are testing for a change on the checkout page.

A more subtle example is let’s say when you decide that you would like to extend your coupon offer from those how spent more than $35 to $25, then you should only analyse changes for those that spent between $25-$35 as those spending more than $35 should not change.

When you are conduct triggering, you have to be very careful about generalizing the result to the rest of the population.

Chapter 21: Sample Ratio Mismatch and Other Trust-Related Guardrail Metrics.

Sample Ratio Mismatch (SRM) is the number one guardrail metric you should look at, if it is violated, you shouldn’t just stop!

SMR happens when the ratio of observed traffic is different to the intended allocated traffic to each variant. A chi-squared test is usually sufficient to test whether this assumption is violated.

Debugging SRM is difficult, but they generally point to fundamental problems that render the experiment invalid.

Chapter 22: Leakage and Interference between Variants.

The SUTVA assumption requires units to be independent, however, this is not always possible. The connection between units can be:

Direct:

In a social network setting, people are connected, impact on a user in treatment may propagate to a user in control and thus result in underestimation of the treatment effect.
Another scenario is when a function is shared between people or a team. Skype/Zoom calls or new feature in a team setting such as Trello boards can affect people in both the control and treatment.

Indirect:

This happens commonly in a two-way market where the change can impact the whole market. If say, the treatment receive a coupon on an accommodation platform, they may be more likely to make a booking but at the same time crowd out the control and reduce the booking rate of control. This can lead to an overestimate of the treatment effect!
Shared resources such as hardware, or marketing budget are also other common reasons for leakage.

In general, the solution is to create isolation between the groups.

Chapter 23: Measuring Term Treatment Effects.

Short-term and long-term effects may not always coincide, selling a lot in a short amount of time can lead to long-term loss of revenue.

There are many reasons why the short term effect diverges from the long term effect.

First of all, user behaviour may not stabilise within the first two weeks. Primacy and novelty effects are almost always present and user can also learn to adapt to the product over time.

Another cause of divergence is associate with impact on the user base and survivorship bias. Putting a lot of ads may increase the ad revenue in the short run, however, poor experience drive customer away and ultimately lead to a fall of revenue. In addition, since tests are usually run over a short period say two weeks, it is heavily weighted toward users with frequent visits. The overall effect can then change when the effect takes place for less frequent visitors after the experiment has ended.

The simplest and popular approach is to just run a long experiment beyond the pre-determined sample size from the power analysis. This, however, is efficient and reduces the speed of testing. Another approach is to curate a stable cohort and only conduct the long-running experiment on this cohort.