“If it can be a test, test it. If we can’t test it, we probably don’t do it.” — Stuart Frisby, ex Director of Design at Booking.com
The Booking.com website boasts one of the highest conversion rates in the industry, at 2-3 times the industry average.
They’re seeing continued growth (26.02% revenue YoY), over 557 million web visits in one month, and are the most downloaded travel app globally.
Much of this can be attributed to their successful culture of experimentation.
Booking.com is far from the only company running large numbers of A/B tests to drive business results. Microsoft, Google, Amazon all operate testing cultures at scale, but what distinguishes Booking.com is the rigour with which experimentation methodology has been embedded throughout the organisation, the scale at which they test and the culture they have built in support of testing.
“Booking.com, Expedia, and their ilk are the exception. Instead of running hundreds or thousands of online tests a year, many firms run no more than a few dozen that have little impact. If testing is so valuable, why don’t companies do it more? After examining this question for several years, I can tell you that the central reason is culture.” – Stefan Thomke, “Building a Culture of Experimentation”, Harvard Business Review
I worked in Product at Booking.com for over 4 years, learning everything there is to know about being exceptional at experimentation. In this case study, I’ll take you through what any organisation can learn from Booking.com’s testing culture, and share the universal principles that can be applied to any company, at any stage.
There are four key reasons that Booking.com leans so heavily on experimentation:
- Builds an evidence based culture
- Spots pockets of value sooner
- Harnesses the power of it’s teams
- Prevents cost mistakes
1. Builds an evidence based culture
Some of the biggest wins that I saw during my time at Booking.com were:
- Changes to copy
- Changes to positioning of a message
- Removing a form field
Some of the craziest losses were also:
- Implementing a change that customers had told us they wanted
- Reducing a step in the customer journey
- Providing customers with more information to make informed decisions
A lot of things that we thought were the right things to do for our customer, just ended up confusing them and preventing them from progressing.
It does not matter how big, or small, the change might be. It doesn’t matter whether you have done discovery and research. Until you test a change to the masses on your website – you won’t know if you have created the right experience or not.
2. Spots pockets of value sooner
Once you accept that everything is an assumption, you realise that experimentation is an incredible way to focus your valuable time in the right areas.
Many organisations have implemented Agile, allowing them to deliver to market faster than ever before. However, this is all a waste of time if the delivered features don’t add value.
3. Harnesses the power of your teams
Prior to working at Booking.com, I came from a role where I was empowered, but if a HiPPO’s idea came up – that would probably win.
At Booking.com, if a HiPPO had an idea, we would ensure that this was framed as a hypothesis, was in line with our strategic goals – and only then, we would test this.
If we were already working on bets that we deemed to be more likely to succeed, our bets would take priority. If we tested the HiPPO’s idea and it failed, then it wouldn’t go live.
No one could argue with this approach. Evidence is king at Booking.
4. Prevents costly mistakes
Experimentation acts as an extra safety net. If all shipped code is wrapped in an experiment, then it is easy to turn the feature on and off again if something is broken.
You can see immediately in the data if something has gone wrong on the website.
I can count at least 10 times where we launched an experiment and needed to turn it off the same day as we had broken something else. We could see immediately in the data that bookings were crashing on our test variant. Sometimes it doesn’t matter how sophisticated your QA process is, something that you never considered can break.
Here are the steps to build an experimentation culture like Booking.com.
Choose the right bets
Before you even get close to starting an experiment, you need to identify the right problem space to work on, where you will really have an impact.
At Booking.com, this was driven by our strategy. What are we trying to achieve overall as a business? Where do we really want to play to win?
Each group of teams knew their key goals and where they were contributing to the strategy. Each team had their own OKRs and areas of focus. These are crucial in establishing the key metrics that you want to drive through your experiments.
The first question you should answer before identifying any experiments to run is: ‘Why is this area important to focus on now?’. The answer should be that it’s an outcome that will drive strategic goals.
If this is not true, you should park the problem area and can assess it again if it becomes relevant.
Use an outcome driven approach
As opposed to defining outputs, you should focus on defining:
- The outcome you want to achieve
- The key results you want to drive
- The knowledge and evidence you have in this space to identify opportunities or problems to solve
- Hypotheses to solve these
- Different tactics to test your hypothesis
Prioritise your bets
Depending on your objective and size of the problems / opportunities that you’re solving, you might have too many ideas to test.
The key ways that I prioritise my bets are by:
- Testing the riskiest assumptions first
- Testing the areas that we believe will add the most value
If you’re working on a big problem space, that is relatively unknown – then rank your bets on a grid like this.
If you’re working on a more simple problem space, you can use a simple value (based on your key outcomes) vs. effort ranking. This allows you to test your ‘quick wins’.
The creation of the experiment is crucial to getting accurate results. I’ve seen people choose the wrong metrics, set the wrong run-time, not testing something that will have a big enough impact – and much more.
This is how to do it:
- Create a clear hypothesis
- Choose the right success metrics
- Understand your experiment controls and run-time
- Analyse and iterate on your experiment
1. Create a clear hypothesis
The structure of a great hypothesis looks like this:
Based on [evidence].
We believe [X] will encourage [these users] to [change behaviour].
We will know this when we see [effect] happen to [metric].
This incorporates all of the important factors that you need:
- Why you believe in your idea (this helps you to create a feature that should work, based on insight)
- What your feature is
- Who it is for
- The behaviour you want to change
- The primary metric that you will use to measure this
It should be:
- Based on evidence
- Outcome based (focussed on what you want to achieve)
- Align with your goals
- Be specific
- Be measurable
You should always write one; this step cannot be skipped. This is the bedrock of your experiment.
2. Choose the right success metrics
This is one of the most important parts of designing an experiment. Usually, an experiment will influence multiple metrics. You will only truly be able to say whether your experiment succeeded or not, if you move the primary metric. The one that you set out to move.
Often in experimentation, teams don’t move the primary metric – but they find some other positive success stories in their data. These become great learnings, but they don’t mean that your experiment was successful and should be switched on. If you do so, you risk the fact that you might have a negative impact on the metric that you really cared about.
There’s many ways to run an A/B test badly:
- Run a small test for too long which negatively impacts user experience for a small subset of your users
- Run tests misaligned to business objectives
- Run a test with a defined primary and secondary metric, but put it into production based on observed behaviour of users according to another metric
- Run tests with low confidence intervals
- Forget about tests and leave them on – forever
Number 3, or variants of the Sharpshooter problem are particularly common: ‘We were testing for net bookings, but we saw clicks go up, so we think we should deploy’. The Sharpshooter problem is where differences in data are ignored but similarities are overemphasised. It’s named after a story about a Texas sharpshooter, who fires some shots at the side of the barn, and then paints the target over the cluster closest together.
It’s related to the clustering illusion, which is about human tendency to create patterns where none actually exist by interpreting random streaks or clusters in small samples as non random.
So choose your primary metric wisely.
Your primary metric should be:
- Fully aligned with the outcome you want to achieve. Ideally it would be one of your KRs, or is a metric that could drive one of your KRs.
- One that you can measure. At Booking.com, we were lucky enough to have every metric you could imagine in our experimentation platform. In other organisations, you might be limited to conversion or other high level metrics.
- Easy to see an impact on this metric in a short space of time: if you’re measuring a metric which takes 2 years to get a read on, this is too long.
- A leading metric. It is far easier to measure a leading metric, than a lagging metric. For example it’s easier to measure people putting items in their basket, than it is to measure revenue. If you really want to drive a lagging metric – you’re best to associate this with a leading metric and then measure improvements to the lagging metric over time.
You can also measure secondary metrics.
You should measure secondary metrics if you’re worried that you might have a negative impact on another part of the experience. So you can measure, for example, bookings (primary metric) and cancellations (secondary metric, to avoid impact on customer experience).
However, look to see if you can factor this into a primary metric. For example: a great primary metric would be net bookings, which takes into account bookings and cancellations.
3. Understanding your experiment controls
Experimentation controls are crucial to ensuring that you’re learning the most you can from your experimentation program. They’re also essential to the level of risk that you’re willing to take as a business, and gate keep experiments to make sure you’re doing experimentation well.
The main factors that we cared about in experimentation controls at Booking.com were:
- The confidence interval
- Run time
Your confidence interval is your appetite for taking risks. The higher your confidence interval, the more likely it is that the result you’re seeing is accurate.
Most companies, including Booking.com, choose 95% as their confidence interval. Very concretely this means that there is a 5% chance that the experiment is wrong. The confidence interval determines the risk of error; and therefore while you can pick a lower confidence interval, it increases the risk of the test being inaccurate – which lowers the utility of doing it in the first place.
You need to calculate run time based on experiment audience size, the impact you expect to have on your primary metric, and the confidence interval that you want to achieve. Sometimes this means you see that an experiment isn’t worth running. There are many tools and calculators that will predict how long you need to run your experiment for.
Even if you have a short run time, make sure to mitigate for seasonality. Booking.com runs all experiments for at least 2 weeks. This allows for fluctuations in the market, for example, seasonal holidays.
If your run time is too long, you want to assess whether you can:
- Access a larger sample size
- Choose a different metric as your primary metric
- Choose a tactic that could have a larger impact to your metric
It’s important to generate these calculations before you even start to code, so you can make sure you’re focussing your time on areas that will generate an impact. At Booking.com, only once you assess all of these areas, can you set-up and switch on an experiment.
4. Analysis and iteration
Analysis and iteration are key to extracting full value from experimentation. The whole point of an experimentation programme, and where you really create value, is when you actually use it to learn from it.
At Booking.com every experiment is analysed for learnings, whether it is positive, negative or inconclusive.
If an experiment is positive, and you understand why, you can leverage your learnings to have even more impact in that space.
If an experiment is negative, it still has an impact on the customer experience. You can assess what didn’t work for the customers, and generate new hypotheses that will allow you to have a positive impact in this space.
If the experiment is inconclusive on your primary metric, you might have influenced some of your secondary levers. The hypothesis could still be valid, but your change might not have been prominent enough. You might have positively impacted one element of the journey, but negatively impacted another.
If you still believe your hypothesis is true, work on another iteration and you’ll be one step closer to achieving a positive result.
A worked example of experimentation at Booking.com
To bring this to life, here’s a real example of an experiment that we ran at Booking.com in the search results space.
The Desired Objective
To make it easy for every customer to choose and book the right stay for them.
Key Result: Net Bookings
Net bookings is one of the key results that can be used to showcase how easy it is for customers to find and book the right stay. The easier it is to find a product that meets your needs, the more likely you are to purchase.
Knowledge and evidence
We had a lot of evidence from past experiments on the search results page.
From these experiments, we knew that filters impacted our key metric: net bookings.
The price filter had one of the highest interaction rates, but it didn’t drive the highest conversion.
From customer research we saw that customers’ price ranges differed to the checkboxes that they displayed. The checkboxes were displayed in increments of £50, and often a customer had a budget in between these increments.
Therefore we crafted this hypothesis:
Based on [the current usage of the price filter, and customer research].
We believe [introducing a custom price filter] will encourage [customers that interact with the price filter] to [set their own price range and find a stay that matches their requirements].
We will know this when we see [a 2% increase] happen to [net bookings].
The hypothesis states exactly what we wanted to change and why, for which target audience, and the impact that we expected to see on our key metric.
Choosing the right success metrics
For this hypothesis, you could choose many different primary metrics:
- Increasing interactions with the filter
- Progression from the search results page
- Increased gross bookings
- Decreased time to select a property
However, the flaw in all of these metrics is that they wouldn’t show if customers had chosen the right stay for them. You might make it quicker for them to progress in some parts of the journey, but they might later change their mind as the property wasn’t the best match. Using an end indicator, like net bookings, accounts for this. Therefore, in this scenario, our primary metric was the same as our key result.
You could use these other metrics as secondary metrics, to assess how the customer behaviour has changed.
Understanding your experiment controls
The search results page is the second page users visit on the site, so it received a lot of traffic. We had a large enough sample size to know that the experiment would conclude with a 2% increase in net bookings in 2 weeks.
The confidence interval remained at 95%.
Analysis and iterations
The first version of the experiment was inconclusive.
Customers interacting with filters, typically have higher conversions than those that don’t, so a change in net bookings could be driven by:
- Increased filter interactions
- Increased conversion for those using the filter
- Decreased cancellations
- Filter interactions increased per customer, and we also found that each individual customer interacted with the filter more times.
- Conversion for this filter was lower.
- Cancellation rates remained the same.
The combination of increased filter interactions and lower conversions is usually a sign that:
- People do want to use the new feature
- However, the feature isn’t optimally designed, hence conversions are lower
This is a great indicator, as it shows that you’re on the right track, but probably need to iterate.
As people were interacting with the filter multiple times, it was also a sign that they were not easily able to select the price range that they wanted in their first go.
Iterating – our new hypothesis:
Based on [the current usage of the price filter, and customer research].
We believe [optimising the custom price filter, to state the price being selected] will encourage [customers that interact with the price filter] to [set their own price range and find a stay that matches their requirements].
We will know this when we see [a 2% increase] happen to [net bookings].
The hypothesis remains exactly the same. We’re still trying to achieve the same goal. However, the tactic has changed to reflect the new variant that we would run. This new variant now ensures that the price only shifts in increments of £10, and that customers can read the price they’ve selected.
New test iteration
All other factors remained the same.
Analysis and iteration
In this example, the experiment was positive, and therefore rolled out to all customers.
There are also learnings to share with the organisation. For example, the best UX patterns for a slider could be fed into the design guidelines.
The example I’ve run through above is typical of how Booking.com works. But to focus on a single example downplays how experimentation permeates Booking.com’s culture.
5 principles give Booking.com the drive and capability to run experiments with the volume, velocity and impact that it does:
- Tests are aligned to the strategy
- Testing underpins our culture
- We test at volume: big tests, a lot of tests
- We test rigorously
- We’ve built the organisation to support testing
1. Tests are aligned to business strategy
Conversion rate optimization is the primary purpose of A/B testing. This is only valuable if you are running experiments on the areas of your business that align to strategic impact.
For many years, the growth flywheel at Booking.com was:
- Increase bookings through AB testing = increased revenue
- Spend increased revenue on paid marketing = increased customer base
- Gain a larger market share to experiment with = increased volume of tests
The majority of tests (9/10) would be deemed as failures: they never make it into features or daily execution. But Booking.com take a similar approach to VCs: the majority of the portfolio will fail, but the 10% which wins will win big enough to pay back the failures.
Everything should be tested
If it can be a test, it should be a test. If it can’t be a test: why do it? Booking.com avoid a ‘we test some things but other things we just know’ culture. They know that everything is an assumption until proven otherwise.
No preferential treatment of HiPPOs:
‘We have evidence that having a C in your title doesn’t make you more successful in deciding what improves your product’ – Stuart Frisby, ex Director of Design at Booking.com
That means that a HiPPO (Highest Paid Person’s Opinion / Highest Paid Person in the Office) has to be submitted to the same testing process as everyone else.
‘Guidelines, not rules’
Anyone at Booking.com can propose a test; management permission is not required. Nor are there restrictions on what can be or how anything can be tested. This is in order to avoid restrictions on people’s creativity.
Data driven culture, supported by 100% access to data
Everyone in the organisation gets access to as much data as they can in order to be able to inform their thinking. The ask in return is that decisions are informed by (test gathered) data.
Customers shape the product
The product is the result of customer feedback through tests over time, rather than any one person’s conception of the ideal product. That means it can wind up looking like something which very few people would design.
Trust driven by enablement
The entire organisation is enabled and supported to run tests via
- An in house testing tool where much of the data output is automated
- Onboarding training on experiments
- On the job support from data science
In 2002, Booking.com made a strategic choice to develop their own experimentation platform. This allowed them to control how they wanted to create experiments, and implement the right data and metrics around this. Later the company added an experimentation department.
The experimentation department could balance the science of experimentation, with the risk appetite of the business. As the experimentation programme grew, they could add more metrics, more guardrails, and the ability to process more and more data, increasing trust in the tool.
They could also learn about how teams were using the platform, and either tweak the platform ie. to allow for new types of testing like non-inferiority, or to educate and train teams on best practice.
“I was at Booking.com for 2 weeks, and I was given an analytics training. I’ve always been very into data, so I thought I had it covered. The person delivering the training was a senior data scientist with an incredible CV who previously worked on a particle accelerator. Needless to say, by slide 3 I was lost!” – Carl Rosseel, ex-Data Analyst at Booking.com
Lots of tests
Volume of tests is an output of a successful culture of testing. Constraining volume of tests increases the risk and frequency of failure, and makes those proposing tests more risk averse.
“When you conduct a large volume of experiments, a low success rate still translates into a significant number of successes, which, in turn, diminish the financial and emotional costs of the failures. If a company does only a handful of experiments a year, it may have only one success or, if it’s unlucky, none. Then failure is a big deal.” – Stefan Thomke, “Building a Culture of Experimentation”, Harvard Business Review
At the same time
Rather than variant A versus variant B for 5% of the total user group, Booking.com run hundreds, sometimes thousands, of atomic tests, all the time, meaning that there are more variants of the Booking.com site in existence than humans that have ever lived.
Another way that this mentality asserts itself is in how they choose to expose users to a test. Many companies, if testing something risky, reduce the size of the sample exposed to the test.
At Booking.com if you’re testing something high risk, you expose it to a lot of people as fast as possible – in order to get a read fast on whether it’s positive or detrimental.
‘We’re going to find out immediately if it’s really bad, rather than letting it be really bad for a few people for a really long time’ – Stuart Frisby, ex Director of Design at Booking.com
Across all platforms, even when it’s hard
Testing on platforms where it can be prohibitive, such as mobile apps, or Apple Watch can be unattractive, since you can’t deploy hotfixes or roll back at speed in all scenarios.
Booking.com has made investments in making testing in these environments as feasible as possible – but part of walking the walk is that they live with the limitations of these environments and still test.
Every part of the product
It can be tempting to test a page, versus another page, and declare a winner, based on the data gathered about how the pages performed according to user data on that page.
It’s important to
- Track changes throughout the user journey – changes to key pages can have downstream effects as well as onpage impact.
- Test every part of your product: if you’re maintaining part of the product, you should be testing it
Including cross functional squads to tackle end to end user experience
Marketing, pricing, site experience and the real-life customer experience all feed into user experience.
I worked on a core problem space: off-airport car hire bookings. A lot of our hypotheses were things that we could influence in our area of control (the website). However, the way that the customer reached the site (marketing), the price of these particular trips, and the ease of picking up a car in a non-airport location were also extremely important factors.
We pulled together a cross-functional team to experiment through and tackle this problem.
Hypotheses > Ideas
Ideas are easy, hypotheses create rigour. Hypotheses come with an inherent requirement to frame the concept and prove and disprove it, which ideas do not.
5. Organisational structure
Booking.com have built their organisational structure around their testing culture. It’s similar in mentality to Amazon’s organisational principle of ‘two pizza teams’, whereby every team is an independent, modular entity, which allows Amazon to scale the organisation almost indefinitely.
The goals are that every team is enabled to execute tests within the team, and without interdependencies. Concretely that means that teams are made up of everything they need to execute tests on the product: for example, a typical consumer facing team might include a product owner, a front and a back end engineer, a designer, a copywriter – and so on.
Teams change as hypotheses change
People move teams as the product moves: people move on average every 9-12 months as hypotheses move on which product areas to invest in and grow.
Hiring for A/B testing fit
Booking.com hires for capabilities like entrepreneurialism, business instincts, curiosity, and the ability to learn and familiarise yourself with statistical modelling and data science techniques. Not everyone has to be a data scientist, but the people they hire should be able to learn about some of the key concepts.
Including the managers
As a senior member of the company, this means:
- Your ideas will be tested and you will be wrong
- Your ideas will be challenged by every level of the hierarchy until backed by data
- You should be comfortable with uncertainty: tests will lead you to evolve your strategy and direction
The fun side of this is bets: it’s common for people at Booking.com to place bets on the outcomes of tests, and to eagerly watch tests the way some people would watch a race.
“Instead of making complex plans that are based on a lot of assumptions, you can make constant adjustments with a steering wheel called the Build-Measure-Learn feedback loop.” – Eric Ries, The Lean Startup
Booking.com have infrastructure to test at scale which many companies do not: traffic volume, investments in organisational culture to support experimentation, and an in-house A/B testing platform.
However, that should not put you off from adopting some of the learnings from their culture in a smaller company, and if you are in a larger company, there’s a lot to imitate.
The core components which make Booking.com strong, and which you can adopt if you’re looking to build a testing engine are:
- A rigorous culture of hypotheses, over ideas
- Embracing decentralised approaches to delivery: a democratic approach, versus a feature delivery culture constrained by HiPPOs and rigid road maps
- Believe the tests: go where the tests take you
- Risk & reward: many tests will fail but some will win big
Most of these are about cultural and mindset changes, and don’t require a big traffic base or expensive tooling.
Additionally if you’re in a smaller company, take heart: as Booking.com themselves note, after a decade of testing they have cleaned up much of the low hanging fruit: finding impactful areas gets harder and harder with maturity. At an earlier stage, your wins are likely to be bigger if you adopt this approach.
Hustle Badger Resources
Why is Booking.com famous for their experimentation culture?
Booking.com’s experimentation approach and culture is credited with their continued dominance of the travel space. In addition their implementation of testing has been so thoroughly embedded at every stage of the organisation that Booking.com now epitomises the A/B testing culture model.
What are the major components of Booking.com’s experimentation culture?
Booking.com’s experimentation culture is a full suite operating model for the business. Employees are hired on the basis of their skillset but also their fit for testing culture. The company’s teams are organised to facilitate tests. Tests occur in every function across the organisation. The company enables tests via an in-house A/B testing platform, onboarding and on the job training, and a top down commitment to tests as a process: everything is tested, regardless of roadmaps, seniority or idea certainty.
What can a company smaller than Booking.com take from Booking.com’s experimentation culture?
The primary learning that other companies can take from Booking.com’s experimentation culture is that experimentation is a mentality and a culture. You can ship products and features without testing but testing gathers far more data, and will inform you that many feature ideas once tested, fail. It might seem expensive up front, but long term it should pay back significantly.