← All Guides

Metric Selection, Guardrail Metrics & OEC: A Learning Guide

What You're About to Understand

After working through this, you'll be able to design a metric framework for an A/B test that won't steer you wrong -- choosing what to optimize, what to protect, and what to watch. You'll spot the difference between a metric that looks informative and one that actually is. And when someone proposes shipping an experiment because "engagement went up," you'll know exactly which follow-up questions to ask before that decision goes sideways.

The One Idea That Unlocks Everything

Think of an experiment's metrics like flying an aircraft. Your OEC (Overall Evaluation Criterion) is the heading -- the single composite direction you're trying to go. Your guardrail metrics are the engine temperature, fuel level, and altitude warnings -- you're not trying to improve them, but if any one of them crosses a red line, you abort the manoeuvre no matter how good the heading looks. Your diagnostic metrics are the individual instrument readouts that help you understand why something is happening. And your data quality metrics are the checks that your instruments are even working.

No pilot would fly with just a heading and no warning lights. No experimenter should ship with just a success metric and no guardrails.

Learning Path

Step 1: The Foundation [Level 1]

Here's a concrete scenario. Amazon sends promotional emails. They want more revenue per email. But every email risks an unsubscribe -- and an unsubscribed user has a calculable lifetime value loss. So Amazon's experiment metric isn't just "revenue." It's:

OEC = (Revenue - Unsubscribes x Lifetime_Loss) / Number_of_Users

That single formula encodes a trade-off: short-term money vs. long-term relationship. This is the OEC in action -- a composite metric that forces the organisation to confront trade-offs once, in the formula, rather than re-arguing them in every experiment review meeting.

Every metric in an experiment belongs to one of five categories:

  1. OEC / Success Metrics -- What you're trying to improve. Tested with superiority tests ("is treatment better than control?").
  2. Guardrail Metrics -- What you're protecting from harm. Tested with non-inferiority tests ("can we prove treatment isn't meaningfully worse?"). Revenue, page load speed, bounce rate are common examples.
  3. Driver Metrics -- Leading indicators that move faster than the OEC. More actionable, more sensitive, shorter-term. They predict the OEC.
  4. Diagnostic Metrics -- Help explain why the OEC moved. Never use them for ship/no-ship decisions.
  5. Data Quality Metrics -- Validate the experiment itself is trustworthy. Sample Ratio Mismatch (SRM) checks are the most important example.

The origin story matters: Ronny Kohavi brought the OEC concept from industrial quality engineering (Taguchi's Design of Experiments) into web experimentation in a landmark 2007 KDD paper. Before OEC, teams at companies like Bing used ad hoc, inconsistent metrics. Different teams optimised different things. Decisions contradicted each other. The OEC was invented to end that chaos.

Check your understanding:
- Why can't you just pick "revenue" as your single experiment metric and call it a day?
- What's the difference between a guardrail metric and a secondary success metric -- and why does that difference matter statistically?

Step 2: The Mechanism [Level 2]

Now let's go deeper into how each piece works.

Building an OEC -- the three hard problems:

First, normalisation. Your OEC components have different units -- revenue in dollars, retention as percentages, latency in milliseconds. Without normalising to a common scale (0-1 or 0-100), the highest-variance metric will dominate the composite. A $500 outlier purchase would swamp a 2% retention improvement.

Second, weighting. This is the hardest part. The weights encode your organisation's values. Get them wrong and you optimise the wrong thing at scale, thousands of experiments marching confidently in the wrong direction. There's no formula for setting weights -- it's a strategic conversation.

Third, directionality. Every metric must unambiguously signal "good" or "bad" when it moves. "Time on site" is the classic trap: more time could mean engaged users or confused users. A metric with ambiguous directionality is worse than no metric at all because it gives you false confidence.

Key Insight: The Deng & Shi (2016) framework decomposes metric quality into two properties: directionality (does movement reliably indicate good or bad?) and sensitivity (can the metric detect real treatment effects in your experiment timeframe?). A metric that lacks either is useless for experimentation.

How guardrails actually work -- a worked example:

Airbnb runs thousands of experiments per month. Each has guardrail metrics (revenue, bounce rate, page load speed, bookings). They set a non-inferiority margin (NIM) for each -- the maximum acceptable degradation. If a guardrail degrades beyond the NIM, the experiment gets flagged. In practice: ~25 experiments get flagged monthly, ~5 get paused. That's the guardrail system doing its job -- catching the 0.5% of experiments that would cause real harm at scale.

The two statistical approaches to guardrails are fundamentally different:

Inferiority testing asks: "Is there evidence the treatment is harmful?" If no evidence, ship. This is "innocent until proven guilty." The risk: underpowered tests will almost never find evidence of harm, so you default to shipping even when harm exists.

Non-inferiority testing asks: "Can we prove the treatment isn't worse by more than our tolerance?" If you can't prove safety, don't ship. This is "guilty until proven innocent." It requires larger sample sizes and choosing an explicit NIM, but it's the more rigorous standard.

Key Insight: The choice between inferiority and non-inferiority testing isn't a statistical question -- it's a values question. It answers: "When uncertain, do we default to shipping or to protecting users?" Spotify recommends starting with inferiority testing and graduating to non-inferiority as teams mature.

Making metrics more sensitive:

CUPED (Controlled-experiment Using Pre-Experiment Data) is perhaps the most powerful practical technique in modern experimentation. It uses each user's pre-experiment behaviour as a covariate: Y_adjusted = Y - theta(X - E[X]). Because most metric variance comes from stable between-user differences (heavy users vs. light users), removing that predictable component achieves ~50% variance reduction at Bing. That effectively halves required sample sizes or experiment durations.

Capping/winsorisation truncates extreme outlier values (typically at 1-5% tails). This is counterintuitive: removing data increases statistical power. Extreme outliers add so much variance that capping dramatically improves signal-to-noise. The critical caveat: don't cap if "whale" users legitimately drive your revenue -- you'd be throwing away the signal you care about.

Check your understanding:
- An experiment shows CTR decreased by 5%, but absolute clicks increased by 10%. What happened, and what's the right decision?
- Why does CUPED work so much better in digital products than in traditional clinical trials?

Step 3: The Hard Parts [Level 3]

Surrogate metrics and the time horizon trap. Netflix's true north is long-term subscriber retention. But that takes months to measure. So they use surrogates -- watch time, "grab attention within 90 seconds." Surrogates have a structural problem: they tend to have HIGH FALSE POSITIVE RATES. The prediction model from short-term signal to long-term outcome is an abstraction layer that can be gamed or misaligned. And validating a surrogate requires waiting for the long-term outcome -- but by then, the product and user base have changed, so the surrogate may already be stale. Active research is exploring Pareto-optimal proxy metrics that best balance bias and sensitivity, but this remains an unsolved problem.

Goodhart's Law accelerates with experimentation power. Here's the structural irony: the better your experimentation platform, the faster you optimise, the more pressure you place on your metrics, and the faster those metrics cease to be good measures. Manheim & Garrabrant identified four failure modes:
- Regressive: Your proxy ignores important unmeasured factors
- Extremal: The metric-outcome relationship breaks at extreme values
- Causal: You mistake correlation for causation
- Adversarial: Someone deliberately games the metric

Adding guardrails doesn't fully solve this. Optimising "improve OEC while not breaking guardrails" is still a lossy compression. Teams can degrade unmeasured quality dimensions while technically satisfying all measured constraints.

Heterogeneous treatment effects hide under averages. A single OEC number can mask segment-level harm. Netflix found algorithm changes that increased game-playing but decreased movie-watching -- the OEC average hid opposite effects. Personalised treatment policies (shipping different variants to different segments) can outperform the global "winner," but designing a segment-aware OEC without losing interpretability is an open research question.

The dirty secret of experimentation at scale: most experiments fail. At Microsoft, only about one-third of experiments improve their target metric. In mature, optimised domains like Bing search, the success rate is even lower. This isn't a sign of bad experimenters -- it's the base rate of product ideas. The experiments that teach the most are the ones that contradict expectations, but humans are psychologically biased to dismiss exactly those results.

Check your understanding:
- If your experimentation platform can now run 10x more experiments per quarter, why might that actually increase the risk of Goodhart's Law failures?
- An experiment improves the OEC in every user segment individually, but the overall OEC decreases. Is this possible, and what's the mechanism?

The Mental Models Worth Keeping

1. The Aircraft Instrument Panel
Metrics aren't a single number -- they're a structured dashboard. OEC is your heading, guardrails are your warning lights, diagnostics are your instruments, data quality is your instrument check. No single metric tells the whole story, and ignoring any category is flying blind.
Use it when: Deciding how many and which metrics to track for an experiment.

2. Goodhart's Ratchet
The more you optimise a metric, the faster it decouples from the thing it's supposed to measure. Experimentation power itself is a risk factor. Every metric has a shelf life under optimisation pressure.
Use it when: A metric has been your primary OEC for more than a year without re-validation. Time to check if it still predicts what you think it predicts.

3. Directionality Before Sensitivity
A metric that moves easily in the wrong direction is worse than one that barely moves. Always validate that a metric has unambiguous directionality before worrying about whether it's sensitive enough to detect effects.
Use it when: Evaluating a proposed metric -- ask "if this goes up, am I sure that's good?" before asking "can I detect a 2% change?"

4. The Denominator Trap
Ratio metrics (CTR, conversion rate) compress two dimensions into one. A change in the ratio could be driven by the numerator, the denominator, or both. Always decompose ratios into their components.
Use it when: Any experiment result reports a change in a percentage or rate metric. Always ask: "what happened to the numerator AND the denominator?"

5. Innocent vs. Guilty Until Proven (Inferiority vs. Non-Inferiority)
The default assumption for guardrails encodes your organisation's risk tolerance. "Ship unless proven harmful" (inferiority testing) is a very different posture from "don't ship until proven safe" (non-inferiority testing). Most teams default to the less rigorous standard without realising they've made a values choice.
Use it when: Setting up guardrail testing -- make the choice of default explicit.

What Most People Get Wrong

1. "Guardrail metrics are just secondary metrics"
Why people believe it: Both are "other metrics besides the primary one." They sit next to each other on dashboards.
What's actually true: Guardrails require a fundamentally different statistical test (non-inferiority, not superiority) and different decision logic. A guardrail that passes a superiority test but fails a non-inferiority test should block shipping.
How to tell the difference: Ask "am I trying to improve this metric or protect it?" If protect, it's a guardrail with different statistical requirements.

2. "If a metric looks surprisingly good, celebrate"
Why people believe it: Big positive results feel like validation. Celebrating wins is natural.
What's actually true: Twyman's Law -- "any figure that looks interesting or different is usually wrong." Surprisingly good metric movements are more often data quality issues or bugs than real treatment effects.
How to tell the difference: Check SRM first. Then check telemetry. Then check for novelty effects. Only after all those pass should you start believing a surprisingly large effect.

3. "The OEC should be the same as the North Star metric"
Why people believe it: Both are "the most important metric." Seems redundant to have two.
What's actually true: The North Star is a long-term strategic metric (Netflix: long-term retention). The OEC is a short-term proxy designed to be sensitive enough for 1-2 week experiments. They're related but serve different purposes on different timescales.
How to tell the difference: If your "OEC" takes more than two weeks to show significant movement in a well-powered experiment, it's a North Star, not an OEC. You need a more sensitive proxy.

4. "More metrics give better insight"
Why people believe it: More data is usually better. Instrumenting everything feels thorough.
What's actually true: Without the success/guardrail/diagnostic taxonomy, more metrics create noise, false positives (with 100 metrics and alpha=0.05, you'll get ~5 false positives by chance), and decision paralysis.
How to tell the difference: For every metric, ask "what decision would change if this metric moved?" If no decision changes, it's not earning its place.

5. "Removing outliers always hurts statistical power"
Why people believe it: Throwing away data seems wasteful. More data = more power is the usual intuition.
What's actually true: Capping extreme outliers increases sensitivity by reducing variance more than it reduces signal. Heavy-tailed distributions (revenue, session length) benefit enormously from winsorisation.
How to tell the difference: Run a sensitivity analysis with and without capping. The capped version will typically show smaller confidence intervals.

The 5 Whys -- Root Causes Worth Knowing

Why does the OEC exist?
Experiments affect multiple metrics simultaneously -> Teams can't make unambiguous ship/no-ship decisions -> Without a single criterion, people cherry-pick the metric supporting their preferred outcome -> Confirmation bias is hardwired -> Making trade-offs explicit ONCE at the org level beats making them implicitly in every meeting -> Root insight: Organisational alignment on "what good means" is the prerequisite for any data-driven culture. Without it, data becomes a weapon for politics rather than a tool for learning.
Level 2: Alignment is hard because different teams have different local objectives. A single metric necessarily compresses these, losing information each team cares about.
Level 3: Without a shared metric, inter-team trade-offs become political (loudest voice wins) rather than principled.

Why did Bing's "queries per user" OEC fail?
More queries per user meant users couldn't find answers, not that they were engaged -> The metric had ambiguous directionality -> Teams assumed queries = engagement without validating the causal model -> Validation requires expensive long-term holdouts -> Organisations default to easy-to-move metrics because experimentation timelines create pressure -> Root insight: Performance evaluation cycles (quarterly) are shorter than metric validation cycles (years). No individual manager bears the cost of a bad metric -- the cost is distributed over time. A tragedy of the temporal commons.
Level 2: The pressure to show results in 1-2 week experiments favours fast-moving, sensitive metrics over valid ones.
Level 3: This organisational pressure persists even when leadership knows it's problematic, because the incentive structure rewards speed over correctness.

Why does Goodhart's Law accelerate with experimentation?
Teams find ways to improve the metric without improving the underlying thing -> Every metric is a proxy, and the proxy-target correlation breaks under optimisation pressure -> Optimisation finds the cheapest path, which is often gaming -> The space of "interventions that improve the metric" is far larger than "interventions that improve the metric AND the underlying thing" -> Root insight: Metrics are dimensionality reductions of complex reality. Any compression loses information that becomes exploitable under pressure. More experimentation power = faster exploitation.
Level 2: Experimentation increases optimisation power, which increases pressure on the metric, which accelerates breakdown.
Level 3: Adding more guardrails is itself a form of Goodhart's Law -- the system can game "improve OEC while not breaking guardrails" by degrading unmeasured dimensions.

The Numbers That Matter

~33% of experiments improve their target metric (Microsoft). Two-thirds of ideas don't work. If your win rate is much higher, you're probably not being ambitious enough -- or your metrics are too easy to game.

~25 guardrail triggers per month at Airbnb, ~5 paused. Out of thousands of experiments. This is the base rate of "experiments that would cause real harm." It's low -- but 5 harmful experiments per month shipped at scale would be devastating.

CUPED achieves ~50% variance reduction (Bing). This effectively doubles your experimental throughput. To put that in perspective: an experiment that would need 4 weeks to reach significance without CUPED needs only 2 weeks with it.

6-10% of experiments have SRM issues (Microsoft: 6%, LinkedIn: 10%). That's roughly 1 in 10-15 experiments with corrupted randomisation. If you're not checking for SRM, you're trusting results from broken experiments.

A 1% SRM can introduce 2%+ metric bias. The bias is NOT proportional -- it's worse. The "missing" users aren't random; they're systematically different (slow connections, specific devices), so their absence shifts averages disproportionately.

Target: 80% statistical power before launch. That means a 20% chance of missing a real effect even in a properly designed experiment. Run below 80% power and you're essentially flipping coins.

Bonferroni with 100 metrics x 10 experiments = alpha of 0.00005 per test. This is why naive multiple testing correction kills your ability to detect anything. FDR (Benjamini-Hochberg) maintains usable power by controlling the proportion of false discoveries rather than the probability of any false discovery.

Netflix: "Grab attention within 90 seconds." A decade of optimisation focused on this single metric. That's the power of a well-chosen OEC -- and the risk, if it ever decouples from actual user value.

Most real treatment effects are 1-5% relative change. That's why sensitivity matters so much. In mature products, the easy wins are gone. The remaining improvements are small and require precise measurement to detect.

Where Smart People Disagree

Single OEC vs. Multi-Metric Framework
Kohavi argues a single composite OEC forces trade-offs, aligns organisations, and prevents cherry-picking. Critics argue it hides trade-offs in opaque weights, creates false precision, and inevitably gets Goodharted. Spotify's compromise -- a structured multi-metric framework with typed metrics and specific tests for each type -- has gained traction but hasn't ended the debate. Unresolved because the right answer likely depends on organisational maturity and size.

Inferiority vs. Non-Inferiority Testing for Guardrails
Inferiority testing is easier, requires less sample, and is a reasonable starting point. Non-inferiority testing is more rigorous but demands choosing an explicit NIM (how much harm will you tolerate?), which forces uncomfortable value conversations. Spotify's pragmatic position: start with inferiority, graduate to non-inferiority. The deeper tension is that non-inferiority requires organisations to explicitly quantify acceptable harm -- a value judgment most resist making public.

Micro-Conversion vs. Macro-Conversion as Primary Metric
"The next user action" is more sensitive and actionable. "The end business goal" is what actually matters. No consensus because the right choice depends on experiment type, business maturity, and team capability. Feature-level metrics (button clicks) are the worst of all worlds -- any new button will get some clicks, proving nothing about product value.

Whether OEC Can Be Truly Validated
Optimists point to long-term holdouts and backtesting. Skeptics note the environment changes too fast -- by the time you could validate, the product and market are different. Pragmatists argue you can validate the OEC is "not obviously wrong" (invalidation is possible even if full validation isn't). This is a genuine epistemological limit, not just a practical one.

Multiple Testing Correction for Guardrails
Conservatives correct across all metrics. Practitioners argue guardrails should be exempt from correction because you want maximum sensitivity to harm. FDR advocates control the false discovery rate rather than family-wise error rate. The debate centres on whether the cost of a false alarm on a guardrail is symmetric with a false alarm on a success metric (it isn't -- blocking a good treatment is much less costly than shipping a harmful one).

What You Don't Know Yet (And That's OK)

Metrics for AI-native products. How do you design an OEC for LLM-based products where sessions per user, page views, and clicks don't apply? A single LLM session might accomplish what previously took 10 search queries. The entire OEC framework may need reinvention. This is an active frontier with no established answers.

Automated Goodhart detection. Can we build systems that automatically detect when a metric has decoupled from its underlying goal? This would be enormously valuable but remains an open research problem.

Inter-experiment interactions. When thousands of experiments run simultaneously on shared metrics, cumulative impacts can emerge that no single experiment triggers. How to detect and manage these interaction effects is a systems problem, not just a statistical one.

Ethical guardrails. How should fairness, equity, and privacy be formally incorporated as guardrails with quantitative thresholds? This sits at the intersection of statistics, ethics, and policy -- and nobody has a clean answer.

Zero-to-one metric design. When you're building something genuinely new, there's no historical data to calibrate metric sensitivity or directionality. How do you choose metrics for products that don't yet exist?

Subtopics to Explore Next

1. CUPED and Variance Reduction Techniques
Why it's worth it: Doubles your effective experimental throughput -- the single highest-leverage technical improvement for any experimentation program.
Start with: Deng et al. 2013 paper "Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data"
Estimated depth: Medium (half day)

2. The Dirty Dozen -- Metric Interpretation Pitfalls (Dmitriev et al. 2017)
Why it's worth it: A checklist of 12 specific ways metric interpretation goes wrong, each with real examples from Microsoft. Immediately applicable to any experiment review.
Start with: The original paper or Adrian Colyer's summary at blog.acolyer.org
Estimated depth: Medium (half day)

3. Non-Inferiority Testing -- Theory and Practice
Why it's worth it: Unlocks rigorous guardrail testing. Most teams use the wrong statistical test for guardrails and don't realise it.
Start with: Spotify's "Risk-Aware Product Decisions in A/B Tests with Multiple Metrics" (2024)
Estimated depth: Medium (half day)

4. Goodhart's Law -- The Four Types (Manheim & Garrabrant)
Why it's worth it: Gives you a taxonomy for diagnosing how a metric is failing, which tells you what to fix.
Start with: Search "Categorizing Variants of Goodhart's Law" by Manheim & Garrabrant
Estimated depth: Surface (1-2 hours)

5. Surrogate/Proxy Metric Validation
Why it's worth it: Connects short-term experimentation to long-term business outcomes -- the central unsolved problem in metric selection.
Start with: Duan & Ba et al. 2021 "Online Experimentation with Surrogate Metrics: Guidelines and a Case Study"
Estimated depth: Deep (multi-day)

6. Multiple Testing Correction (Bonferroni, FDR, and Beyond)
Why it's worth it: Understanding when and how to correct for multiple comparisons is essential once you're tracking more than a handful of metrics.
Start with: Search "Benjamini-Hochberg procedure explained" for intuition, then read the Spotify paper for applied context
Estimated depth: Medium (half day)

7. SRM Detection and Root Cause Analysis
Why it's worth it: SRM is the single most useful data quality check in experimentation. Knowing how to detect and diagnose it prevents acting on corrupted results.
Start with: Search "Sample Ratio Mismatch in A/B Testing" -- Microsoft ExP team has extensive writing
Estimated depth: Surface (1-2 hours)

8. Heterogeneous Treatment Effects and Personalised Experimentation
Why it's worth it: The frontier of experimentation -- moving from "does this work on average?" to "for whom does this work, and how much?"
Start with: Search "conditional average treatment effect experimentation" or "CATE estimation"
Estimated depth: Deep (multi-day)

Key Takeaways

Sources Used in This Research

Primary Research:
- Kohavi, Henne, Sommerfield (2007) -- "Practical Guide to Controlled Experiments on the Web" (KDD)
- Deng & Shi (2016) -- "Data-Driven Metric Development for Online Controlled Experiments" (KDD)
- Dmitriev, Gupta, Kim, Vaz (2017) -- "Twelve Common Metric Interpretation Pitfalls"
- Deng, Xu, Kohavi, Walker (2013) -- "Improving the Sensitivity of Online Controlled Experiments" (CUPED paper)
- Kohavi, Tang & Xu (2020) -- Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing (Cambridge University Press)
- Spotify Engineering (2024) -- "Risk-Aware Product Decisions in A/B Tests with Multiple Metrics"
- Duan, Ba et al. (2021) -- "Online Experimentation with Surrogate Metrics" (arXiv)
- Various (2023) -- "Pareto Optimal Proxy Metrics" (arXiv)
- Various (2023) -- "Statistical Challenges in Online Controlled Experiments: A Review" (The American Statistician)

Expert Commentary:
- Spotify/Confidence -- "Better Product Decisions with Guardrail Metrics"
- Tatiana Xifara / Airbnb (2021) -- "Designing Experimentation Guardrails"
- Jon Noronha -- "The Perils of Experimenting with the Wrong Metrics"
- Adrian Colyer (2017) -- Summary of the Dirty Dozen paper
- Microsoft Research / ExP -- "Beyond Power Analysis: Metric Sensitivity" and "Deep Dive Into Variance Reduction"
- Netflix Tech Blog -- "Reimagining Experimentation Analysis at Netflix"
- Statsig -- "Decoding Metrics and Experimentation with Ron Kohavi"
- VWO -- "Three Kinds of Metrics: Success, Guardrail, Diagnostic"

Good Journalism:
- GrowthBook -- "Goodhart's Law and the Dangers of Metric Selection"
- Dave Redfern -- "What is the Overall Evaluation Criterion (OEC)"

Reference:
- Eppo -- "What are Guardrail Metrics? With Examples"
- PostHog -- "Guardrail metrics for A/B tests, explained"
- Analytics Toolkit -- "What is an Overall Evaluation Criterion?"
- Statsig -- "Picking Metrics 101"
- Mixpanel -- "Guardrail metrics: The complete guide"