Experimentation Program Maturity: A Learning Guide

What You're About to Understand

After working through this guide, you'll be able to diagnose where any experimentation program sits on its maturity journey — and explain why it's stuck there. You'll spot the difference between a team that runs lots of tests and one that's genuinely mature. You'll know which lever to pull next (culture? process? tools?) and why pulling the wrong one wastes years. And when someone tells you "we need to run more tests," you'll know exactly the right follow-up question to ask.

The One Idea That Unlocks Everything

The Flywheel, not the staircase.

Most people picture experimentation maturity as climbing stairs — Level 1, Level 2, Level 3, onward and upward. That mental model is wrong in a way that causes real damage.

The better image is a flywheel. Picture a massive, heavy wheel. The first push barely moves it. The second push adds a tiny bit of momentum. But each push is easier than the last, and eventually the wheel's own momentum does most of the work.

Here's the flywheel for experimentation: Run tests → Measure their value → Generate interest from stakeholders → Win investment in infrastructure → Lower the cost per test → Run more (and better) tests. Each turn feeds the next. Skip a step — say, you run tests but never communicate their value — and the wheel stops.

This model, developed by Ron Kohavi and Lukas Vermeer from their experience at Microsoft and Booking.com, explains nearly everything about why programs succeed or fail. Programs don't stall because they're on the wrong "level." They stall because one part of the flywheel is broken.

If you remember nothing else: maturity isn't a destination you reach. It's momentum you build — and can lose.

Learning Path

Step 1: The Foundation [Level 1]

Imagine you're a product manager at a mid-size e-commerce company. Your team has an idea: moving the "Add to Cart" button above the fold will increase conversions. Your boss loves the idea. The designer mocks it up. Engineering ships it. Revenue goes... down.

Nobody knows why. Nobody measured it properly. And the team has already moved on to the next idea.

This is what "ad-hoc experimentation" looks like — or more precisely, what the absence of experimentation looks like. The idea was a hypothesis, but nobody treated it as one.

Now picture the alternative. Before shipping, you set up an A/B test: 50% of visitors see the old layout, 50% see the new one. After two weeks with sufficient sample size, the data shows the new layout actually decreased conversions by 3%. You've just saved the company from a bad decision. That "failed" test? It's one of the most valuable things your team did that quarter.

This is the core of experimentation maturity: moving from "we ship ideas we believe in" to "we test hypotheses we're uncertain about."

The concept has deep roots. Philip Crosby created the first maturity model in 1979 for manufacturing quality. Carnegie Mellon's SEI adapted it for software in 1991 (the Capability Maturity Model). By the 2010s, as digital A/B testing tools became widespread, practitioners realized that having Optimizely or VWO wasn't enough — organizations needed to mature in how they used these tools.

Today, several frameworks map this progression. The details differ, but they all describe roughly the same arc:

Stage	What it looks like
Reactive/Chaotic	Ad-hoc testing, one champion, no strategy, no documentation
Emerging	Small team gaining buy-in, some wins, no formal process
Strategic	Experimentation recognized as business strategy, frameworks in place
Integrated	Testing culture spreads across teams, leadership actively supports
Optimized/Transformative	Experimentation drives all major decisions, industry-leading

The major frameworks — Speero's four-pillar model, Conversion.com's five-stage model, CXperts' five-level CRO model — all measure maturity across multiple dimensions: strategy & culture, people & skills, process & methodology, and data & tools. The critical insight is that your weakest dimension constrains everything else. A team with world-class tools but no statistical literacy is not mature — it's dangerous.

Check your understanding:
1. Why is a "failed" A/B test (one where the variant lost) potentially more valuable than not testing at all?
2. If a company has excellent experimentation tools but poor culture around testing, what maturity stage are they likely stuck at — and why?

Step 2: The Mechanism [Level 2]

Here's a fact that surprises almost everyone: only about 12% of A/B tests produce a winning result. That's from Optimizely's analysis of 127,000 experiments. Eighty-eight percent of the ideas people test don't work.

This isn't a sign of failure. It's the whole point.

If most ideas worked, you wouldn't need experimentation — you'd just ship everything. The value of a mature experimentation program is that it catches the 88% of bad ideas before they reach your customers. The "save" — preventing a negative outcome — is often worth more than the "win," but almost nobody measures it.

Key Insight: The most mature organizations have the highest failure rates because they test bolder, more uncertain hypotheses. Immature programs only test "safe" ideas (button color changes) that have higher win rates but lower impact. This is the testing paradox — looking good on win rate is actually a sign of timidity.

How the flywheel works in practice — a worked example:

Microsoft developed this model across 24+ product teams running 20,000+ experiments per year.

Turn 1 — First push: A team selects a test where stakeholders disagree. (This is deliberate — counterintuitive results generate the most organizational interest.) They keep it simple: one treatment, one control, 50/50 split. The result surprises people. Now they're paying attention.

Turn 2: The team measures and communicates the value of that test — both the win and, crucially, the decision it changed. Interest spreads. Another team wants to try.

Turn 3: Growing interest justifies investment in better infrastructure — maybe an internal platform, maybe a commercial tool. Setup time drops from weeks to days.

Turn 4: Lower cost per test means more teams can participate. The hypothesis pipeline grows. Test velocity increases without sacrificing quality.

Turn 5: The cycle repeats, faster each time.

The organizational mechanism — Centers of Excellence:

How maturity physically spreads through an organization follows a predictable pattern:

Centralized CoE (early stages): One expert team runs all tests. This builds capability but creates a bottleneck.
Hub-and-spoke/federated (mid-maturity): The central team provides tools, training, and standards. Individual teams ("spokes") execute their own tests.
Decentralized with guardrails (highest maturity): Self-service experimentation with automated quality checks. At Booking.com, any employee can launch an experiment without management permission.

The CoE's mission should be enablement — teaching teams to fish, not running all tests centrally. This is the organizational flywheel: as more teams gain capability, the central team shifts from doing to coaching to governing.

The Booking.com story illustrates the full mechanism:

It started in 2004 when a single engineer attended a Ron Kohavi talk and realized experimentation could "settle constant, time-sucking arguments." Today, 75% of Booking.com's 1,800 technology/product staff actively use the experimentation platform. They run 1,000+ concurrent experiments at any given moment. Tests deploy across 75 countries and 43 languages in under an hour.

But here's what most people miss: Booking.com didn't get there by removing governance. They invested massively in automated guardrails — Sample Ratio Mismatch detection, pre-test checks, monitoring pipelines, safeguards for end-to-end ownership. The guardrails are the governance. This is the democratization paradox: the most mature programs (anyone can test) and the least mature programs (nobody controls anything) look similar from the outside but are fundamentally different. One has invisible guardrails; the other has nothing.

Check your understanding:
1. Microsoft deliberately selects tests where stakeholders disagree. Why is this a strategic choice for building the flywheel, rather than just an interesting observation?
2. Explain the democratization paradox: why does Booking.com's "anyone can test" model require more investment in governance, not less?

Step 3: The Hard Parts [Level 3]

Everything you've learned so far has a clean logic to it. Flywheel spins, maturity grows, everyone wins. Now for the parts where the model breaks.

The Velocity Trap

Manuel da Costa's research at Efestra found that when programs exceed 30 tests per developer per year, expected impact drops by 87%. That's not a typo. The program that doubled its testing velocity may have nearly eliminated its impact.

Why? Because velocity that exceeds an organization's capacity to generate good hypotheses turns experimentation into a numbers game. The hypothesis pipeline shifts from research-driven to backlog-filling. Garbage in, garbage out. And because test volume is easy to measure and report, it becomes a Goodhart's Law victim: the metric becomes the target, and the actual goal (better decisions) gets lost.

This challenges a fundamental assumption of many maturity models: that more tests equals more mature. Da Costa found an e-commerce company running 200+ experiments/year with 15+ specialists where only 15% of product decisions were guided by reliable experimentation data. Activity is not capability.

The Interaction Problem at Scale

Stefan Thomke gives a vivid example: changing font color to blue tests at +1%. Changing background to blue tests at +1%. Ship both changes together? The result is negative — the text becomes unreadable. When Booking.com runs 1,000+ concurrent experiments, these interaction effects become a massive concern. Current detection methods are computationally expensive and imperfect. This makes program-level ROI calculation fundamentally uncertain — cumulative impact is non-additive.

The Implementation Gap

Companies demonstrate a clear winner in testing, but can't ship it to production. The winning variant enters the implementation queue. By the time engineering ships it weeks later, circumstances have changed enough to require re-testing. Experimentation capability without deployment capability is pointless. This is a systems problem that no experimentation maturity model adequately addresses.

The Hidden Cost of Winning

Research from Kellogg/Northwestern reveals something deeply counterintuitive: even successful experiments introduce complexity into organizations, making it harder to run future experiments. More features means more interactions means harder-to-test future changes. Successful experimentation programs face an entropy tax that compounds over time. Lyft's 2020 reinforcement learning algorithm tested positively on all metrics but created long-term complications nobody anticipated.

The Cultural Regression Problem

Maturity isn't permanent. Organizations can lose experimentation capability when leadership changes, budgets are cut, or priorities shift. The word "maturity" implies a destination; the reality is closer to a dynamic equilibrium. No existing model adequately addresses how to make maturity resilient to organizational shocks. The biological metaphor of "fitness" may be more honest than developmental "maturity."

Key Insight: The hardest insight in the entire field is this — experimentation maturity is ultimately about organizational epistemology. It means moving from "we know what works" to "we have hypotheses we can test." This threatens the fundamental basis on which organizational power is built: expertise, experience, authority. The leaders who must champion this change are the ones whose authority is most threatened by it.

Check your understanding:
1. A VP of Product tells you "we doubled our test velocity this quarter." Is this good news or bad news? What question would you ask to find out?
2. Why might a highly successful experimentation program actually become harder to run over time, even with growing organizational support?

The Mental Models Worth Keeping

1. The Flywheel
Maturity builds iteratively through value-investment cycles, not through one-time transformations. Each turn (run → measure → interest → invest → lower cost) makes the next turn easier. Use it when: you're diagnosing why a program stalled — find the broken link in the cycle.

2. The Velocity Trap
More tests can mean less impact. The shift from counting tests to counting decisions improved is the hallmark of genuine advancement. Use it when: someone equates program health with test volume.

3. The Weakest-Link Constraint
Organizations occupy different maturity positions across different dimensions simultaneously. You can be Level 4 in tools and Level 1 in culture — and the Level 1 constrains everything. Use it when: deciding where to invest next — always shore up the weakest dimension.

4. The Democratization Paradox
The most mature programs and the least mature programs both look like "anyone can test." The difference is invisible automated guardrails. Use it when: evaluating whether an organization's openness to testing reflects sophistication or chaos.

5. The Insurance Framing
The real value of experimentation is often the prevented losses (the 88% of bad ideas you didn't ship), not the wins. But organizations almost never measure or communicate this defensive value. Use it when: making the business case for experimentation to skeptical leadership.

What Most People Get Wrong

1. "More tests = more mature"
- Why people believe it: It's intuitive — more practice should mean better performance. Maturity models that emphasize velocity reinforce this.
- What's actually true: Beyond ~30 tests per developer per year, impact drops by 87%. Mature programs run better tests, not necessarily more. The metric is decisions improved, not experiments completed.
- How to tell the difference: Ask whether the team can articulate what they learned from their last 10 tests. If they can't, velocity is outrunning learning.

2. "A 12% win rate means our program is failing"
- Why people believe it: In most contexts, an 88% failure rate sounds terrible.
- What's actually true: 12% is the expected base rate across 127,000 experiments. The value isn't the 12% of ideas that work — it's the 88% of bad ideas that never reached production.
- How to tell the difference: Calculate the cost of shipping those 88% of losing ideas without testing. That's the program's "save" value.

3. "Experimentation replaces intuition"
- Why people believe it: "Data-driven" culture messaging implies data should make all decisions.
- What's actually true: Thomke calls this Myth #1. Intuition generates hypotheses; experiments validate or refute them. The two are symbiotic, not competitive. Mature organizations are "data-informed," not "data-driven."
- How to tell the difference: Check whether the team's experiment backlog comes from research and insight, or from random brainstorming. The former uses intuition well; the latter wastes experimental capacity.

4. "We just need the right tool"
- Why people believe it: Tools are tangible, purchasable, and easy to evaluate. Culture change is hard and slow.
- What's actually true: Tools are the easiest dimension of maturity. Culture, process, and skills matter more and take longer. Companies bought Optimizely or VWO and expected experimentation to work — then discovered the tool was the least of their problems.
- How to tell the difference: If a team has great tools but no documented hypotheses, no QA process, and no knowledge base, the tool isn't the bottleneck.

5. "We don't have enough traffic to experiment"
- Why people believe it: Sample size calculators show daunting numbers for detecting small effects.
- What's actually true: Sample size depends on expected effect magnitude. Larger effects need smaller samples. Specialized statistical methods (sequential testing, Bayesian approaches) can partially offset traffic limitations.
- How to tell the difference: Ask what effect size they're trying to detect. If it's a 1% change, yes, they need massive traffic. If it's a 20% change from a major redesign, far less traffic suffices.

The 5 Whys — Root Causes Worth Knowing

Chain 1: "88% of experiments don't win"
Claim → Most human-generated ideas are wrong about what improves outcomes → Because cognitive biases (availability, confirmation, anchoring) systematically distort hypothesis quality → Organizations compound individual bias through groupthink and HiPPO dynamics → The 12% win rate is actually the expected base rate for navigating a high-dimensional solution space → Root insight: Experimentation is the only reliable way to navigate complex systems because analytical solutions don't exist for emergent behavior.
- Level 2 deep: Customer behavior emerges from the interaction of psychology, context, and competitive alternatives — too many non-stationary variables to model analytically.
- Level 3 deep: Experimentation sidesteps modeling entirely by measuring the actual system response in real-time.

Chain 2: "Culture is the hardest dimension to change"
Claim → Culture involves deeply held beliefs across many people → People resist changing behaviors that have historically been rewarded → Existing incentive structures reward certainty and punish failure → Markets reward predictable outcomes; experimentation introduces variance → Root insight: Most organizational structures are built for exploitation (optimizing known approaches), not exploration (testing uncertain hypotheses). This is a structural inheritance from industrial-era management science.
- Level 2 deep: Management science was built on planning, forecasting, and control — all assuming the future is predictable.
- Level 3 deep: Updating this requires distributed authority and tolerance for failure, which threatens the hierarchical power structures that leaders depend on. The people with authority to change the system are most invested in its current form.

Chain 3: "Only 1 in 10 companies reach transformative maturity"
Claim → Transformative maturity requires alignment across all dimensions simultaneously → Each dimension creates dependencies on others; the weakest link constrains everything → Investment in one dimension often means neglecting another → Early investments take years to pay off; leadership turnover disrupts timelines → Root insight: The payoff of transformative maturity is largely invisible (prevented mistakes, better decisions) while the cost is visible (headcount, tools, time) — creating chronic under-investment.
- Level 2 deep: Counterfactual value ("what would have happened without this experiment") is inherently unmeasurable.
- Level 3 deep: This is a fundamental epistemological problem, not just a measurement problem. Causal inference can estimate average treatment effects but cannot reconstruct specific counterfactual histories.

Chain 4: "58% of companies lack a prioritization framework"
Claim → Most companies rely on ad-hoc brainstorming for experiment ideas → Prioritization requires explicit criteria that force trade-offs, which creates conflict → Conflict avoidance is rewarded in most corporate cultures → Without prioritization, politically powerful stakeholders' ideas get tested regardless of expected value → Root insight: The experimentation program itself is not being run experimentally — a meta-irony where the program meant to remove HiPPO decision-making reproduces it.
- Level 2 deep: Meta-optimization (optimizing the optimizer) requires systems thinking that most organizations haven't developed.
- Level 3 deep: Education, incentives, and career paths are all organized around functional specialization. Seeing the program as a system requires bridging strategy, statistics, technology, and organizational behavior.

The Numbers That Matter

Number	What it means
12% average win rate (Optimizely, 127K tests)	Nearly 9 in 10 ideas don't work. This is the strongest argument for experimentation, not against it — it means most ideas you'd ship without testing would hurt.
54% of companies at strategic/transformative level (2025, up from 35% in 2021)	The industry is on a roughly 5-year maturation curve. But only 1 in 10 reach the transformative tier.
87% impact drop beyond 30 tests/developer/year	There's an inflection point where velocity kills quality. To put that in perspective: doubling your testing speed could cut your results by seven-eighths.
1,000+ concurrent experiments at Booking.com	That's like running a different experiment for every feature on the site, simultaneously. Possible only because per-experiment marginal cost is near zero.
58% without a prioritization framework	More than half of experimentation programs are pointing their experiments randomly. That's like having a laboratory with no research agenda.
52% with no QA process for experiments	Half of all programs launch experiments without checking them first — meaning they could be measuring the wrong thing and never know.
6% SRM prevalence	1 in 17 experiments produces untrustworthy results from Sample Ratio Mismatch alone. Without detection, these bad results look perfectly valid.
5-20 people, 2-4 years to build an in-house platform	That's a small company's worth of engineers. Only organizations with $250B+ market cap typically justify this investment. GrowthBook is ~5x cheaper than Optimizely for equivalent functionality.
$600M → $3.4B AI testing automation market (2023-2033)	19% CAGR signals massive investment — but only 30% currently describe AI as "highly effective" for testing. The promise outpaces the reality, for now.
75% of startups use A/B testing (Duke/Harvard, 13,935 startups)	Experimentation isn't niche anymore. Three-quarters of startups are doing it. The question is no longer whether to test but how well.

Where Smart People Disagree

1. Velocity vs. Quality — Is experimentation a numbers game or a precision game?
- Pro-velocity (Optimizely, GrowthBook): With a 12% win rate, more shots on goal mathematically increases the probability of significant discoveries. The flywheel model lowers cost per test so volume doesn't sacrifice quality.
- Pro-quality (Manuel da Costa/Efestra): Beyond 30 tests/dev/year, impact drops 87%. Measure decisions improved, not tests run. Activity is not capability.
- Unresolved because: The optimal testing rate is almost certainly context-dependent — varying by organization size, traffic, industry, and hypothesis-generation capacity. Nobody has established a general model for finding it.

2. Democratization vs. Governance — Should experts control testing?
- Pro-democratization (Vermeer/Booking.com): Any employee should be able to test. Automated guardrails replace human gatekeeping. This scales better than expert bottlenecks.
- Pro-governance (Traditional CRO agencies): Expert oversight prevents bad experiments, statistical misuse, and metric gaming. Automated guardrails can't capture every edge case.
- Unresolved because: Whether machines can substitute for human statistical judgment depends on how sophisticated the guardrails are — and building sophisticated guardrails requires the kind of expertise the governance camp advocates for.

3. Single Metric (OEC) vs. Multi-Metric Evaluation
- Pro-OEC (Kohavi): A single Overall Evaluation Criterion forces explicit trade-offs and organizational alignment. Amazon's OEC for email included lifetime unsubscribe cost: OEC = (Revenue - Unsubscribes × Lifetime_Loss) ÷ Users.
- Against: Critics argue this oversimplifies multi-stakeholder trade-offs, creates gaming incentives, and can't adequately capture long-term effects.
- Unresolved because: How to weight short-term vs. long-term factors in a single metric is itself an unsolved problem. The OEC requires assumptions about future value that are inherently uncertain.

4. Are Maturity Models Genuine Diagnostics or Marketing Tools?
- Pro: Frameworks like Speero's provide genuine benchmarking and actionable roadmaps. Companies that use them progress faster.
- Against: With 20+ competing models and no convergence, each agency creates their own to sell assessments. The proliferation suggests commercial incentive outweighs scientific rigor.
- The meta-irony: The field of experimentation maturity models is itself immature — lacking the kind of evidence-based evaluation that experimentation is supposed to provide.

5. Ethics of Testing Without Consent
- Pro-testing (Thomke): "The real jeopardy is NOT experimenting." UI changes are normal business practice. Users benefit from better experiences.
- Against (Academic ethicists): Back-end algorithm changes that manipulate user behavior cross ethical lines. Dark patterns may have been A/B-tested into existence. A 2023 Springer article argues the ethical dimension has been "neglected."
- Legal frontier: Whether dark patterns specifically optimized via A/B testing should face stricter consumer protection scrutiny is an open regulatory question.

What You Don't Know Yet (And That's OK)

After absorbing this material, you understand the conceptual architecture of experimentation maturity, the major frameworks, the flywheel mechanism, and the key debates. Here's where your knowledge runs out:

Optimal testing velocity for a given organization size and type remains an open question. The 30 tests/developer/year threshold is a single data point, not a law.
Counterfactual valuation — how to quantify the value of decisions you didn't make because experiments prevented them — is a fundamental epistemological problem that the field hasn't solved.
Interaction effects at scale — when 1,000+ experiments run concurrently, detecting and managing interference between them is computationally expensive and theoretically incomplete.
Cultural regression — how to make experimentation maturity resilient when leadership changes, budgets are cut, or priorities shift — has no proven solution.
Low-traffic environments — most maturity models were built for high-traffic digital products. How they apply to B2B SaaS, physical products, healthcare, or government is largely uncharted.
Whether AI can genuinely generate good experiment hypotheses or only optimize execution of human-generated ones remains early and uncertain.
The right success metric for an experimentation program — decisions improved? Revenue attributed? Prevented losses? Learning velocity? — is still debated with no consensus.

Subtopics to Explore Next

1. Statistical Foundations of A/B Testing
Why it's worth it: Without understanding p-values, confidence intervals, power analysis, and sequential testing, you can't evaluate whether an experiment's results actually mean what they claim.
Start with: Ron Kohavi's "Trustworthy Online Controlled Experiments" (2020), chapters on statistical methodology.
Estimated depth: Medium (half day)

2. The Overall Evaluation Criterion (OEC) and Metric Design
Why it's worth it: How you define success determines what your experiments optimize for — including whether they accidentally optimize for dark patterns.
Start with: Kohavi's LinkedIn article on OEC design; Amazon's email OEC as a worked example.
Estimated depth: Medium (half day)

3. Organizational Change Management for Experimentation
Why it's worth it: Culture is the hardest and most impactful dimension. Understanding change management frameworks (Kotter, ADKAR) applied to experimentation unlocks the "how" of cultural transformation.
Start with: Thomke's HBR article "Building a Culture of Experimentation" (March 2020).
Estimated depth: Deep (multi-day)

4. Feature Flagging and Experiment Infrastructure
Why it's worth it: The technical plumbing — feature flags, progressive rollouts, assignment logic — determines what's possible to test. Understanding it reveals why some organizations can iterate in hours while others take months.
Start with: DevCycle or LaunchDarkly documentation on feature flag architecture; GrowthBook's open-source platform docs.
Estimated depth: Medium (half day)

5. Causal Inference Beyond Randomized Experiments
Why it's worth it: Not everything can be A/B tested. Synthetic controls, regression discontinuity, and difference-in-differences extend experimentation into offline and physical contexts.
Start with: Search "causal inference methods for business" — Scott Cunningham's "Causal Inference: The Mixtape" is an accessible starting point.
Estimated depth: Deep (multi-day)

6. The Build vs. Buy Decision for Experimentation Platforms
Why it's worth it: This decision commits years of engineering time or locks you into a vendor. Understanding the trade-offs (GrowthBook at ~5x cheaper than Optimizely, open-source vs. enterprise) prevents costly mistakes.
Start with: Eppo's "Build vs. Buy Decision Framework" blog post; GrowthBook's comparison pages.
Estimated depth: Surface (1-2 hours)

7. Experimentation Ethics and Dark Patterns
Why it's worth it: As experimentation scales, the line between optimization and manipulation gets blurry. Understanding the ethical frameworks prepares you for regulatory changes that are coming.
Start with: "The Ethics of Online Controlled Experiments" (Springer, 2023).
Estimated depth: Medium (half day)

8. Bayesian vs. Frequentist Approaches in Business Experimentation
Why it's worth it: The statistical paradigm you choose affects how you make decisions under uncertainty — including when to stop tests early and how to handle multiple comparisons.
Start with: Search "Bayesian A/B testing explained" — VWO and Dynamic Yield have accessible introductions.
Estimated depth: Deep (multi-day)

Key Takeaways

The flywheel beats the roadmap. Maturity isn't built by following a step-by-step plan — it's built by creating self-reinforcing cycles where each success funds the next investment.
Your weakest dimension is your real maturity level. World-class tools plus poor culture equals poor experimentation. Always invest in the weakest link.
Failure is the expected default, not a sign of dysfunction. An 88% "failure" rate means the program is working — it's catching bad ideas before they ship.
Volume without hypothesis quality is negative value. Past ~30 tests per developer per year, impact can collapse by 87%. Speed that outpaces thinking is destructive.
The most valuable experiments are the ones that prevent bad decisions. But nobody celebrates the disaster that didn't happen, which chronically undervalues experimentation.
Democratization without guardrails is chaos; guardrails without democratization is a bottleneck. You need both, and the guardrails should be automated.
Counterintuitive results are the most strategically valuable. They demonstrate experimentation's unique value and generate the organizational interest that spins the flywheel.
Culture change is harder than any technical problem because it threatens the authority structures of the people who must champion it.
The experimentation program should itself be run experimentally. 58% of companies without a prioritization framework are reproducing the HiPPO problem they're trying to solve.
Organizations can regress. Maturity isn't permanent. Leadership changes, budget cuts, and priority shifts can unwind years of progress. Build resilience, not just capability.
The short-term measurement bias is structural, not accidental. Experimentation optimizes for what it measures, and what it measures is biased toward the measurable — which is almost always short-term.
One person at one conference started Booking.com's experimentation culture. The right insight, landing on fertile organizational soil at a moment of pain, can catalyze extraordinary change.
The "just need the right tool" belief has wasted more experimentation budgets than any other misconception. Tools are the easiest and least important dimension.
Successful experiments create an entropy tax. More features mean more interactions, which make future testing harder. This is the hidden cost of winning.

Sources Used in This Research

Primary Research:
- Kohavi, Tang, Xu — Trustworthy Online Controlled Experiments (Cambridge, 2020)
- Thomke — Experimentation Works (HBS Press, 2020)
- Thomke — "Building a Culture of Experimentation" (HBR, March 2020)
- Vermeer et al. — "Democratizing Online Controlled Experiments at Booking.com" (ResearchGate, 2017)
- Microsoft Research / Kohavi / Vermeer — "It Takes a Flywheel to Fly" (2021)
- Speero — Experimentation Maturity Program Reports (2023, 2025)
- VWO — Experimentation Program Maturity Report (2024)
- Springer / Minds and Machines — "The Ethics of Online Controlled Experiments" (2023)
- Kellogg/Northwestern — "The Hidden Cost of Successful Experiments" (2020+)
- Duke/Harvard — Study of 13,935 startups using A/B testing (2019)

Expert Commentary:
- Manuel da Costa / Efestra — "What Really Is Experimentation Maturity?" (2025); "The Experimentation Maturity Myths" (LinkedIn, 2024)
- GrowthBook — "9 Common Pitfalls That Can Sink Your Experimentation Program" (2024)
- Martijn Scheijbeler — "20 Reasons Why Most Experiment Programs Fail" (2023)
- Thomke — "Seven Myths of Business Experimentation" (Strategy+Business, 2020)
- Kohavi — "The Overall Evaluation Criterion (OEC)" (LinkedIn)
- Vermeer — "It Takes a Flywheel to Fly" (Booking Product / Medium, 2021)
- Optimizely — "Scaling Experimentation Program's Metrics in 2025"
- Eppo — "Build vs. Buy Decision Framework" (2024)
- AB Tasty — "How to Prevent Knowledge Turnover" (2023)
- Natasha Wahid — "Changing Organizational Culture for Experimentation" (2020)

Good Journalism:
- Convert.com — "A/B Testing & CRO Stats" (2024)
- DevCycle — "Adopt the 10,000 Experiment Rule" (2024)

Reference:
- Speero — Experimentation Program Maturity Audit
- Conversion.com — The Conversion Maturity Model; "From Quick Wins to Cultural Shifts"
- CXperts — "The 5 Levels of CRO"
- Fresh Egg — CRO Maturity Model
- Kameleoon — "What is a Center of Excellence in Experimentation?"
- Wikipedia — Quality Management Maturity Grid; Capability Maturity Model
- Kohavi — Experiment Guide (experimentguide.com)
- Koalatative — "3 Most Common Experimentation Maturity Buckets" (2024)
- UMSL — "History & Origin of CMMI" (2013)
- Statsig — "Experimentation ROI: Proving Platform Value" (2024)