Long-Run Effect Estimation: A Learning Guide
Holdout Groups, Switchback Testing, and the Gap Between What You Measure and What's Real
What You're About to Understand
After working through this guide, you will be able to explain why summing your A/B test wins almost certainly overstates your product's real improvement -- and design the right method to find out by how much. You'll know when to reach for a holdout group versus a switchback experiment versus a surrogate index, and you'll be able to spot the failure modes that make each technique give wrong answers. Most practically, you'll be able to walk into a meeting where someone presents "cumulative experiment impact" and ask the one question that separates honest measurement from wishful accounting.
The One Idea That Unlocks Everything
Think of your experimentation program as a mutual fund, and the holdout as the audit.
Each A/B test is like buying a stock. The analyst says it went up. Great. You buy fifty stocks over a year, each one reportedly a winner. But when the auditor checks the total portfolio value at year-end, it's less than the sum of the reported gains. Why? Some gains were illusory (winner's curse -- the stock was up the day you looked). Some stocks cannibalized each other (two streaming features fighting for the same user attention). Some early gains faded (novelty wore off). The holdout group is the auditor who checks the actual portfolio balance against the alternative of never investing at all.
Now here's the twist: some investments can't be measured by a simple holdout because the market itself reacts to your trades. When your actions change the system for everyone -- like a pricing change in a ride-hailing marketplace -- you need a different instrument. That's where switchback experiments come in: instead of splitting people, you split time, giving the whole market one treatment and then switching it back.
If you remember only this: individual experiment results are self-reported gains; long-run estimation methods are the independent audit.
Learning Path
Step 1: The Foundation [Level 1]
The Problem No One Talks About
Imagine you run an experimentation program at a streaming service. Over a quarter, you ship 15 features. Each one showed a statistically significant lift in its A/B test: +0.3% retention here, +1.2% engagement there. You sum them up and proudly report: "Our experimentation program delivered +8% engagement this quarter."
Then someone runs a holdout -- a small group of users (say 5%) who never received any of those 15 features -- and compares them to everyone else. The actual cumulative lift? Maybe +4%. Maybe +2%.
This isn't a bug. It's a predictable consequence of how experimentation works. Three forces drive the gap:
-
Winner's curse: When you declare a test a "winner" based on its observed lift, you've selected for results that include positive noise. The true effect is almost always smaller than the observed effect. This is provable selection bias -- not bad practice, just math.
-
Novelty effects: Users poke at new features out of curiosity. That initial burst of engagement fades. Your 2-week A/B test captured the honeymoon, not the marriage.
-
Cannibalization: Feature A gets users to watch more trailers. Feature B gets users to browse more recommendations. Both "work" individually. But users have finite time -- Feature A and Feature B are competing for the same 45-minute evening session. 2 + 2 = 3, not 4.
What a Holdout Actually Is
A holdout (or holdback) is simple in concept: take 1-5% of your users, exclude them from all new feature launches for 3-6 months, then compare their metrics to everyone else. The holdout group stays on the "old" product. Everyone else gets the evolving product. The gap between them is your experimentation program's actual cumulative impact.
What a Switchback Experiment Is
A switchback flips the logic of an A/B test. Instead of splitting users (half get treatment, half get control), you give everyone the same treatment -- but you alternate over time. DoorDash, for example, tests surge pricing by giving an entire region surge pricing for 30 minutes, then switching it off for 30 minutes, then on again, randomly.
Why? Because in a marketplace, splitting users doesn't work. If you show surge pricing to half the riders in Manhattan, those riders order fewer deliveries, freeing up couriers for the control riders. You've contaminated your control group. Switchback avoids this by ensuring that at any given moment, everyone is in the same condition.
Check your understanding:
- Why does summing individual A/B test lifts overestimate cumulative impact? Name at least two distinct mechanisms.
- In one sentence, what is the core difference between when you'd use a holdout versus a switchback?
Step 2: The Mechanism [Level 2]
How Holdouts Diagnose the Audit Gap
The holdout works because it provides a single, clean, unbiased comparison that captures winner's curse, novelty decay, and cannibalization simultaneously. You don't have to untangle which force caused how much of the gap -- the holdout just tells you the net reality.
Worked example -- Disney Streaming: Disney ran a 7-month universal holdout pilot. They shipped multiple engagement-boosting features during the period. Individually, each feature showed positive lifts. But the holdout revealed that engagement-driving impacts of one feature often partially cannibalized the impacts of another. The cumulative effect was substantially less than the sum of parts. Disney settled on 3-month holdout periods as their standard.
Eppo (an experimentation platform) offers a calibration benchmark: a holdout showing 80% of summed individual impacts "should be cause for celebration." Results at 20% suggest systemic problems -- your experimentation process is producing more noise than signal.
Key Insight: The holdout doesn't just measure impact. It measures the trustworthiness of your entire measurement apparatus. It's a meta-measurement.
How Switchback Handles Interference
The formal term for the contamination problem is a SUTVA violation: the Stable Unit Treatment Value Assumption requires that each user's outcome depends only on their own treatment. Marketplaces violate this spectacularly -- a pricing algorithm shown to drivers affects outcomes for riders, and vice versa.
Switchback resolves this by making the unit of randomization time rather than users. Within each time block, everyone gets the same treatment, so there's no within-period contamination. The treatment effect is estimated by comparing outcomes across treatment periods versus control periods.
Worked example -- DoorDash: DoorDash tests dispatch and pricing algorithms using 30-minute switchback blocks across geographic regions. But there's a critical detail: they discard the first ~7 minutes of each block. Why? Carryover effects. When the system switches from treatment to control, residual effects linger -- drivers are still repositioned from the previous period, user sentiment carries over. This "burn-in" or "washout" period lets the system settle.
The fundamental tradeoff: shorter blocks give you more data points (lower variance) but more carryover contamination (higher bias). Longer blocks give cleaner periods but fewer comparisons. Under geometric mixing assumptions (Hu & Wager, 2022), the optimal block length can be computed analytically. In practice, DoorDash found that standard error estimation must use cluster-robust methods -- ignoring the correlation structure within region-time units produces false positives.
The Winner's Curse: A Deeper Look
Netflix analyzed 123 historical A/B tests and built a Bayesian hierarchical model to correct for winner's curse. Their finding: a corrected decision rule would have increased cumulative returns by an estimated 33%. The winner's curse isn't a minor nuance -- it's a major systematic bias, especially for underpowered experiments (where the noise-to-signal ratio is highest).
Why can't you just use the same data to both decide and measure? Because conditional on selecting a "winner" (positive observed lift), you've filtered for experiments where positive noise exceeded any negative signal. The observed lift is the true lift plus noise, and you've conditioned on their sum being positive. This is a textbook selection bias.
Check your understanding:
- Why does DoorDash discard the first 7 minutes of each 30-minute switchback block? What tradeoff does this create?
- Netflix found that correcting for winner's curse would improve cumulative returns by 33%. Why is this bias especially severe for underpowered experiments?
Step 3: The Hard Parts [Level 3]
The Holdout's "Ground Truth" Isn't Ground Truth
Here's the expert-level nuance that separates practitioners from textbook users: the holdout measures the cumulative effect relative to a frozen product experience. That is not the same as the effect relative to "no experimentation program."
The counterfactual of "no experimentation" isn't a frozen product. It's a product developed via HiPPO decisions (Highest Paid Person's Opinion), intuition, or competitor copying. The holdout tells you what your experiments did compared to stasis -- but stasis was never the alternative.
As the holdout ages, it develops additional problems:
- Survivorship bias: If you don't add new users to the holdout, the group ages. New users entering the platform go into treatment. The holdout becomes systematically different in composition.
- ML model staleness: If treatment includes continuously-retrained ML models, the holdout's frozen model doesn't just fail to improve -- it actively degrades as data distributions shift.
- Compensating behaviors: Users stuck on a sufficiently outdated experience develop workarounds or disengage entirely. This isn't what would happen in a "gradually evolving product" -- it's an artifact of the extreme divergence.
Some practitioners (notably Schaun Wheeler) argue this makes holdouts a "false idol" -- the comparison looks scientific, but the thing being compared has become increasingly invalid.
The Surrogate Index: Predicting the Long Run from the Short Run
Athey, Chetty, Imbens & Kang proposed the surrogate index: combine multiple short-term outcomes (surrogates) to predict long-term treatment effects without waiting. The logic: if the treatment affects long-term outcomes only through short-term surrogates, you can decompose the estimation into (1) treatment-to-surrogate effects (from the experiment) and (2) surrogate-to-long-term-outcome relationships (from observational data).
The power and the danger are in the surrogacy assumption. If there's a "quiet channel" -- a treatment effect that doesn't show up in any short-term metric but does affect long-term retention -- the surrogate index will be biased, and you won't know it. This assumption is fundamentally untestable.
Spotify extended the approach by constructing an instrumental variable from regression residuals combined with experimental data, handling some forms of unobserved confounding. But the research frontier remains active.
Switchback Under Extreme Conditions
Two frontier challenges are pushing switchback theory:
-
Reinforcement learning treatments (arXiv, 2024): When the treatment is an RL policy that adapts to outcomes, the policy itself changes during the experiment, creating feedback loops that violate standard switchback assumptions.
-
Long-term treatments (Netflix, ICML 2024): When the treatment shapes user preferences over time (e.g., a recommendation algorithm that gradually changes what users want to watch), short-run experiments can't even capture the treatment's mechanism, let alone its long-run effect.
Check your understanding:
- Why is the holdout's "frozen product" a different counterfactual from "no experimentation program"? What practical consequence does this distinction have?
- Under what specific condition does the surrogate index approach give biased estimates? Why is this condition untestable?
The Mental Models Worth Keeping
-
The Portfolio Audit Model: Your experimentation program is a portfolio. Individual test results are self-reported gains. Holdouts are the independent audit. Never confuse reported gains with audited returns.
-
The Interference Boundary: Ask "does treating person A change outcomes for person B?" If yes, user-level randomization is broken. You need unit-level randomization -- over time (switchback), geography (geo-experiment), or both.
-
The Bias-Variance Seesaw (for switchback design): Shorter blocks = more data but more carryover contamination. Longer blocks = cleaner data but fewer comparisons. The optimal point depends on how fast treatment effects decay -- a pricing change fades in minutes, a recommendation algorithm shift lingers for days.
-
Selection-Estimation Coupling: Using the same data to decide "is this a winner?" and to estimate "how big is the win?" creates systematic upward bias. This is the winner's curse, and it's not avoidable without either separate estimation data, Bayesian shrinkage, or external validation.
-
The Finite Attention Budget: User engagement is roughly zero-sum within a session. Features compete for the same limited resource: human time and attention. This makes individual feature impacts non-additive by default.
What Most People Get Wrong
1. "A 1% holdout is enough"
- Why people believe it: It seems like a reasonable compromise between measurement and business cost. Platform documentation sometimes suggests it.
- What's actually true: Unless you have hundreds of millions of users, 1% is dramatically underpowered. You need at minimum 5%. The effective sample size is dominated by the smaller group -- the 99% treatment group doesn't help you much.
- How to tell: Run a power calculation for your holdout. If the MDE for your holdout is larger than the cumulative effect you're trying to detect, it's useless.
2. "Holdouts are just longer A/B tests"
- Why people believe it: Both involve a control group exposed to the status quo.
- What's actually true: The target estimand is fundamentally different. A holdout measures the cumulative impact of many features. An A/B test measures one feature. The analytical approach, required duration, and failure modes are all different.
- How to tell: Ask "is this measuring one change or many?" That's your answer.
3. "More switching = more power in switchback"
- Why people believe it: Intuition from standard experiments (more observations = more power).
- What's actually true: More switching increases carryover contamination, which can dominate the variance reduction. There's an optimal switching frequency.
- How to tell: If your treatment has any persistent effects (and nearly all do), faster switching makes your estimates more biased, not more precise.
4. "If every experiment wins, the program must be winning"
- Why people believe it: It seems like simple addition should work.
- What's actually true: Cannibalization, winner's curse, and novelty decay mean individually positive features can combine to a net-negative or near-zero cumulative effect. Disney found this directly.
- How to tell: This is precisely what holdouts are designed to detect.
5. "Switchback experiments measure individual treatment effects"
- Why people believe it: They look like A/B tests with a time dimension.
- What's actually true: Switchback measures the global treatment effect -- everyone treated vs. everyone control. This is a market-level estimand, distinct from the individual-level effect in standard A/B tests. The distinction matters enormously for extrapolation.
The 5 Whys -- Root Causes Worth Knowing
Chain 1: "Holdouts show cumulative impact is less than summed wins"
Claim: Summed individual wins overstate reality.
Why 1: Winner's curse inflates each estimate.
Why 2: Conditional on declaring a "winner," observed lift includes positive noise.
Why 3: Same data used to decide AND estimate -- selection-estimation coupling.
Why 4: Separate decision/estimation data would require running every experiment twice.
Why 5: Experimentation is already slow; doubling it would halve innovation rate.
Root insight: The coupling between "deciding to ship" and "measuring the effect" is an inherent information-theoretic constraint, not a solvable engineering problem.
Level 2 deep: Bayesian shrinkage can partially correct this, but the prior itself is estimated from noisy data -- shrinkage trades one bias for another.
Level 3 deep: The holdout remains valuable as a model-free check on whatever correction method you use.
Chain 2: "Switchback experiments are necessary for marketplaces"
Claim: Standard A/B tests fail in two-sided markets.
Why 1: User-level randomization produces biased estimates.
Why 2: Treating some users differently affects other users' outcomes (interference).
Why 3: Marketplace participants compete for shared resources.
Why 4: Market-clearing mechanisms operate on the entire pool, not individual pairs.
Why 5: Efficient allocation requires global optimization -- fundamentally incompatible with local treatment assignment.
Root insight: Markets are coordination mechanisms; the whole point is that everyone's actions affect everyone else.
Level 2 deep: The interference structure is unknown and changes from moment to moment.
Level 3 deep: Research suggests SUTVA violation can cause 2-5x overestimation in marketplace settings.
Chain 3: "Google reduced mobile ad load by 50% after long-run analysis"
Claim: Short-run optimal ad load was 2x the long-run optimum.
Why 1: Users learn from ad quality -- good ads teach users to click; bad ads teach users to ignore.
Why 2: Humans form heuristic rules from repeated experience (cognitive economy).
Why 3: Each additional ad provides diminishing short-run revenue but accelerating long-run damage.
Why 4: The user learning effect is invisible in short-run experiments.
Why 5: The long-run revenue-maximizing ad load is dramatically lower than the myopic optimum.
Root insight: When treatment effects are non-stationary and path-dependent, short-run optimization can be catastrophically wrong about the long-run optimum.
The Numbers That Matter
| Number | What It Means |
|---|---|
| 5% minimum holdout | The realistic floor for a holdout that will detect anything useful. 1% sounds efficient but is statistically useless unless you're at Facebook scale. To put it in perspective: with a 1% holdout, your minimum detectable effect might be 5x larger than the cumulative lift you're trying to find. |
| 3-6 months holdout duration | The standard window. Short enough to limit ethical discomfort and engineering burden; long enough to capture novelty decay and slow-building effects. Disney settled on 3 months (one quarter). |
| 80% of summed wins = good | Eppo's benchmark: if your holdout shows 80% of the sum of individual experiment lifts, your experimentation program is healthy. At 20%, something is systemically broken -- your measurements are mostly noise. |
| +33% cumulative returns from winner's curse correction | Netflix found that simply correcting for winner's curse (without changing which features shipped) would increase cumulative returns by a third. The winner's curse isn't a footnote -- it's a primary driver of the gap. |
| 7 of 30 minutes discarded | DoorDash's burn-in ratio (~23%) for switchback experiments. Remarkably close to Brandt's 1938 dairy experiment: 7 of 35 days (20%). The physics of carryover decay is consistent across domains. |
| 10x efficiency of ghost ads | Ghost ads reduce experimentation cost by at least an order of magnitude compared to traditional advertising holdouts, while maintaining the same measurement precision. |
| 50% mobile ad load reduction | Google's Hohnhold et al. found that accounting for long-run user learning effects cut the optimal mobile ad load in half. That's the magnitude of error possible when you ignore long-run effects. |
| 6 quarters | Athey et al. found that combining six quarters of outcome data into a surrogate index was sufficient to approximate long-run effects in a labor market study. |
| Error rate: sqrt(log(T)/T) | Under geometric mixing assumptions, the switchback difference-in-means estimator's error decays at this rate -- slower than the sqrt(1/N) rate of standard A/B tests due to temporal correlation. |
Where Smart People Disagree
1. Are holdouts worth the cost?
What it's really about: Whether a model-free audit justifies months of deliberately degraded experience for paying customers.
- Pro (Netflix, Eppo, Disney): There is no alternative way to get an unbiased measure of cumulative impact. The winner's curse alone justifies the investment. Netflix validated their Bayesian correction model against holdback tests -- the holdout is ground truth.
- Con (Wheeler, critics): Holdouts age out, require massive scale, force maintenance of legacy code, and the "ground truth" isn't actually ground truth (it's comparison to a frozen product, not to the real alternative). Bayesian shrinkage may be sufficient.
- Unresolved because: The debate is partly empirical (how big is the gap?) and partly philosophical (what counterfactual matters?).
2. Design-based vs. model-based inference for switchback
What it's really about: Robustness versus power.
- Design-based (Bojinov): Treat outcomes as fixed and unknown; only the assignment mechanism is random. No modeling assumptions means no misspecification risk.
- Model-based (DoorDash): Use multilevel models, covariates, and parametric assumptions for greater statistical power.
- Likely resolution: Domain-dependent. Design-based for high-stakes decisions; model-based for routine experiments where power is the bottleneck.
3. Are experiment interactions a big deal?
What it's really about: Whether the "2+2=3" problem is common or rare.
- Microsoft: Research shows interactions are "rare and tiny." The winner's curse is the main driver of the holdout gap.
- Disney: Found cannibalization was endemic, not exceptional, in content recommendation.
- Unresolved because: The answer likely depends on the product domain. Content recommendation features compete for the same attention; UI tweaks are more independent.
4. Can surrogates replace holdouts?
- Pro: Faster, cheaper, don't require withholding treatment. Theoretically elegant.
- Con: The surrogacy assumption is untestable and likely violated when "quiet channels" exist.
- Emerging consensus: Surrogates as complement, not replacement. Use surrogates for initial estimation, validate with periodic holdouts.
What You Don't Know Yet (And That's OK)
Open problems the field hasn't solved:
-
The right estimand for long-run effects in evolving products: The "long-run effect" of a feature depends on what other features exist. As the product changes, so does the effect. Is there a stable, meaningful quantity to estimate? This is philosophically unresolved.
-
Switchback with multiple simultaneous treatments: Current theory handles single-treatment switchback well. Multi-treatment designs with different switching schedules introduce complex interactions that aren't yet well-understood.
-
Bounding bias from surrogate assumption violations: Without testability, we can't know how wrong surrogate estimates might be. Partial identification / sensitivity analysis methods are underdeveloped for this setting.
-
Adaptive treatments that evolve over time: When the treatment is an RL policy or a continuously retrained ML model, the concept of a fixed "treatment" breaks down. Netflix (ICML 2024) identifies this as a frontier problem: how to infer long-term causal effects of long-term treatments from short experiments.
-
Competitive dynamics: If your product improves and competitors respond, the measured long-run effect includes competitive reactions that weren't present in the experiment. No current method handles this.
Where your new knowledge runs out: You now understand why long-run estimation is hard and the major approaches to addressing it. You don't yet have the statistical toolkit to implement optimal switchback designs (that requires Bojinov & Simchi-Levi's framework), build surrogate indices (Athey et al.'s econometrics), or construct Bayesian winner's-curse corrections (Netflix's hierarchical models). Each of these is a substantial technical investment.
Subtopics to Explore Next
1. The Winner's Curse in Online Experimentation
Why it's worth it: This is the single largest contributor to the gap between reported and real experimental impact, and understanding it reshapes how you interpret every A/B test result.
Start with: Netflix's "Estimating the Returns from an Experimentation Program" (Ejdemyr, 2024) and Statsig's "The Winner's Curse: Why Winners Underperform."
Estimated depth: Medium (half day)
2. Causal Inference Under Interference (SUTVA Violations)
Why it's worth it: Unlocks understanding of why marketplace experimentation requires fundamentally different designs, and connects to the broader causal inference literature.
Start with: Bojinov's blog post "Beyond A/B Testing: A Practical Introduction to Switchback Experiments," then the formal paper (Management Science, 2023).
Estimated depth: Deep (multi-day)
3. Bayesian Methods for Experimentation (Shrinkage and Hierarchical Models)
Why it's worth it: Bayesian shrinkage is the leading alternative to holdouts for correcting winner's curse -- understanding it lets you evaluate whether your organization needs holdouts at all.
Start with: Amazon Science's "Overcoming the Winner's Curse" and Etsy's "Mitigating the Winner's Curse."
Estimated depth: Medium (half day)
4. Surrogate Index Methods for Long-Term Effect Estimation
Why it's worth it: The fastest path to long-run estimates when holdouts are too expensive, and the econometric framework is elegant and transferable.
Start with: Athey et al.'s NBER Working Paper 26463. Then Spotify's extension handling unobserved confounders.
Estimated depth: Deep (multi-day)
5. Geo-Experiments and Synthetic Control Methods
Why it's worth it: As privacy regulation erodes user-level tracking, geo-experiments become the primary tool for incrementality measurement. Meta's GeoLift is the practical starting point.
Start with: Meta's GeoLift methodology documentation, then the broader synthetic control literature (Abadie et al.).
Estimated depth: Medium (half day)
6. Ghost Ads and Advertising Incrementality Measurement
Why it's worth it: A 10x efficiency gain over traditional holdouts in advertising -- and the underlying principle (counterfactual logging) applies beyond ads.
Start with: Johnson, Lewis, Nubbemeyer (2017) in the Journal of Marketing Research.
Estimated depth: Surface (1-2 hours)
7. Novelty and Primacy Effects in Online Experiments
Why it's worth it: Until you understand how novelty distorts short-run results, you can't correctly interpret any A/B test on a returning-user population.
Start with: Sadeghi & Gupta et al., "Novelty and Primacy: A Long-Term Estimator for Online Experiments" (Technometrics, 2022).
Estimated depth: Surface (1-2 hours)
8. Optimal Experimental Design for Switchback (Formal Theory)
Why it's worth it: Moves you from "use switchback when there's interference" to being able to design the optimal block length, burn-in period, and randomization schedule for your specific context.
Start with: Hu & Wager (2022) on geometric mixing, then Bojinov & Simchi-Levi (2023) on minimax optimal design.
Estimated depth: Deep (multi-day)
Key Takeaways
-
The sum of your A/B test wins is almost certainly an overestimate of your actual cumulative impact -- the winner's curse, novelty effects, and feature cannibalization guarantee it.
-
The winner's curse is not sloppy statistics; it's an inherent consequence of using the same data to decide and to measure -- and it's the single largest contributor to the gap.
-
A holdout is not a longer A/B test -- it answers a fundamentally different question: "Is our experimentation program working?" rather than "Did this feature work?"
-
In any system where treating person A changes outcomes for person B, user-level A/B tests produce biased estimates -- and the bias can be 2-5x.
-
Switchback experiments trade one problem for another: they solve interference but introduce carryover effects, and the optimal design depends on how fast those effects decay.
-
The "ground truth" from a holdout is a comparison to a frozen product, not a comparison to the actual alternative (which would be product development without experimentation). Keep this distinction sharp when interpreting results.
-
Shorter switchback blocks are not always better -- they increase bias from carryover contamination even as they reduce variance from having more observations.
-
Ghost ads achieve 10x the efficiency of traditional ad holdouts by logging counterfactual auction outcomes rather than actually withholding ads.
-
Google found that the long-run optimal ad load was half the short-run optimal -- a vivid demonstration of how dramatically short-run and long-run optima can diverge.
-
Surrogate indices can replace waiting for long-run outcomes, but they carry an untestable assumption -- and "quiet channels" (effects invisible in short-term metrics) make them silently wrong.
-
Feature cannibalization is product-domain-dependent: Microsoft found interactions "rare and tiny" in general product features; Disney found them endemic in content recommendation. Know your domain.
-
Bayesian shrinkage can partially correct winner's curse without holdouts, but the holdout remains valuable as a model-free validation -- it's the check on your check.
-
The switchback estimand is fundamentally different from the A/B test estimand: switchback measures "everyone treated vs. everyone control," not the effect on one individual. Don't extrapolate one as if it were the other.
-
Holdout groups that aren't refreshed develop survivorship bias, ML model staleness, and compensating user behaviors -- all of which can make the holdout measurement misleading in ways that overstate your program's impact.
Sources Used in This Research
Primary Research:
- Bojinov & Simchi-Levi (2020/2023), "Design and Analysis of Switchback Experiments," Management Science
- Hu & Wager (2022), "Switchback Experiments under Geometric Mixing"
- Athey, Chetty, Imbens & Kang (2019/~2024), "The Surrogate Index," Review of Economic Studies
- Johnson, Lewis, Nubbemeyer (2017), "Ghost Ads," Journal of Marketing Research
- Hohnhold, O'Brien, Tang (KDD 2015), "Focusing on the Long-term"
- Ejdemyr (2024), "Estimating the Returns from an Experimentation Program" (Netflix)
- Goffrier et al. / Spotify (2023), long-term effects with unobserved confounding
- Missault et al. / Amazon (2024-2025), carryover detection in switchback
- Xiong, Chin, Taylor (2023), data-driven switchback designs
- Netflix Research (ICML 2024), long-term causal effects of long-term treatments
- Sadeghi, Gupta et al. (2021/2022), novelty and primacy effects
- Brandt (1938), original switchback trials in dairy cattle
Expert Commentary:
- Eppo: holdout methodology, benchmarks, cumulative impact measurement
- Disney Streaming / Tian Yang (2021): universal holdout pilot findings
- DoorDash (multiple posts): switchback implementation, cluster-robust SEs, burn-in
- Netflix Tech Blog / Ejdemyr & Chou: return-aware experimentation
- Spotify Research (2023): long-term effects from short-run experiments
- Bojinov: practical introduction to switchback
- Wheeler: critique of holdout groups
- Bolt Labs / O'Connell: switchback design considerations
- Amazon Science & Etsy Engineering: winner's curse mitigation
Reference:
- Statsig: switchback overview, winner's curse explainer
- Meta/Facebook: GeoLift methodology documentation
- Pedro Monjo (2025): tiered holdout groups