A/B Testing Methodology: A Learning Guide

What You're About to Understand

After working through this guide, you'll be able to design an A/B test that won't lie to you — from writing a hypothesis that actually constrains your thinking, through calculating the sample size that makes the test worth running, to reading the results without falling into the traps that invalidate most experiments in the wild. You'll spot when someone's peeking at results, know why "Bayesian testing solves peeking" is dangerously wrong, and understand why the smartest companies treat a flat result as a win. You'll also be able to explain, in plain language, why 90% of experiments fail — and why that's the whole point.

The One Idea That Unlocks Everything

A/B testing is a courtroom trial, not a treasure hunt.

Most people approach A/B testing like prospecting — dig around in the data until you find gold. That's exactly backwards. An A/B test is a trial. You start with a specific accusation (hypothesis). You set the rules of evidence before the trial begins (sample size, significance threshold, metrics). You collect evidence under strict procedure (randomization, fixed duration). And you reach a verdict based on pre-agreed standards — not on how you feel about the defendant.

The moment you change the rules mid-trial — peeking at results, extending the test, adding metrics, shifting your hypothesis — you've corrupted the proceedings. The verdict means nothing. Every major pitfall in A/B testing is a violation of this principle: changing the rules after seeing the evidence.

If you remember only this, you'll have better instincts than most practitioners.

Learning Path

Step 1: The Foundation [Level 1]

Imagine you run an e-commerce store. Your checkout page converts at 3%. Your designer says, "If we simplify the form to three fields, conversions will increase." Your CEO says, "No, add trust badges instead." Who's right?

Neither of them knows. And that's not an insult — it's a statistical fact. Data from Microsoft, Google, and Netflix shows that only about 10% of product changes produce measurable improvement. Expert intuition is wrong far more often than it's right. The Highest Paid Person's Opinion (HiPPO) is a coin flip at best.

So you run an experiment.

The basic pipeline looks like this:

Hypothesis: "If we reduce checkout fields from 7 to 3, then checkout completion rate will increase by at least 2%, because session recordings show 40% of users abandon mid-form."
Power analysis: Calculate how many users you need. The formula is n = 16(σ²/δ²), where σ² is variance and δ is the minimum effect you want to detect (MDE). For a 3% baseline conversion testing a 2% relative lift, you'll need thousands of users per group.
Randomize: Split visitors randomly — half see the old form, half see the new one.
Wait: Run the test for the predetermined duration. Do not touch it.
Analyse: Compare conversion rates using a statistical test. If the p-value is below 0.05, you reject the null hypothesis (that there's no difference).
Decide: Ship, iterate, or kill — informed by data rather than opinion.

The hypothesis format is critical: "If [specific change], then [measurable outcome], because [rationale backed by data]." The "because" clause is what separates a testable hypothesis from a guess. Your rationale should come from analytics, heatmaps, session recordings, surveys, or usability testing — not from your gut.

Key Insight: The hypothesis, the metric, and the sample size form what practitioners call the Experimentation Triangle. All three must be locked in before the test starts. Deciding any of them after seeing data corrupts the entire experiment.

Check your understanding:
- Why does the hypothesis need a "because" clause, and what sources should inform it?
- If your baseline conversion rate is 3% and you want to detect a 2% relative lift, why can't you just run the test on 200 users and call it a day?

Step 2: The Mechanism [Level 2]

Now the crucial question: why does this procedure actually work?

Randomization is the entire game. When you randomly assign users to control and treatment, you guarantee that — in expectation — both groups are identical on every characteristic: age, intent, device, mood, time of day, everything you can measure and everything you can't. This is the only research design that establishes causation, not just correlation. Without randomization, you can never be sure the difference you see is caused by your change rather than by some lurking difference between the groups.

The Central Limit Theorem is why your statistics work. Regardless of how weirdly distributed individual user behavior is (and it's very weird), the average behavior in a large sample approaches a normal distribution. This is why z-tests and t-tests are valid — they rely on normality of the sampling distribution, not normality of individual data points.

Why α = 0.05 and power = 0.80? These are conventions, not laws of nature. Alpha (0.05) means you accept a 5% chance of a false positive — declaring a winner when there's no real difference. Fisher reportedly picked it as "worth a second look." Power (0.80) means you accept a 20% chance of missing a real effect. Going from 80% to 90% power increases your required sample size by ~30%. Most businesses find 80% to be the economically rational sweet spot — but for major product launches, 90% or 95% power is justified.

Worked example — what a p-value actually means:

You run your checkout test. The treatment group converts at 3.15% vs. control at 3.00%. Your test returns p = 0.03.

This does NOT mean "there's a 97% chance the new form is better." It means: "If there were truly no difference between the forms, data this extreme would occur only 3% of the time by chance." It's P(data | no effect), not P(effect | data). These are fundamentally different questions, connected only through Bayes' theorem — which requires you to estimate how likely the effect was before you saw the data.

This distinction is the single most misunderstood concept in all of applied statistics. Human brains naturally think in Bayesian terms ("what's the probability this works?"), but the frequentist framework gives you something else entirely. Understanding this gap is what separates literate practitioners from everyone else.

Key Insight: A p-value answers the question you didn't ask. You want to know "does this work?" The p-value tells you "how surprising is this data if it doesn't work?" These are not the same question, and confusing them is the root of most statistical malpractice.

Check your understanding:
- If randomization ensures the groups are equivalent, why do you still need a statistical test? Why not just compare the raw numbers?
- A colleague says "our p-value is 0.03, so there's a 97% probability the treatment is better." What's wrong with this statement, and what would a correct interpretation be?

Step 3: The Hard Parts [Level 3]

Here's where the simple model shatters.

The Peeking Problem — or, why your dashboard is lying to you.

You launch your test on Monday. By Wednesday, the dashboard shows p = 0.04. Ship it?

No. That 0.04 is meaningless. Here's why: the p-value calculation assumes you looked at the data exactly once, at a pre-determined sample size. When you peek repeatedly, you give the test statistic — which behaves like a random walk — multiple chances to wander past the significance boundary. Check 10 times, and your actual false positive rate inflates from 5% to roughly 26%. You've quintuple your error rate just by watching.

The mathematics are elegant and unforgiving: a random walk will eventually cross any fixed threshold (a property of Brownian motion). Each peek is another lottery ticket for a false positive. Most A/B testing platforms compute p-values continuously and display them as if the sample size were fixed — meaning every time you look before the test ends, the number on your screen means something different from what you think it means.

Solutions exist but cost you something. Sequential testing methods (Pocock bounds, O'Brien-Fleming bounds, alpha spending functions, or the mSPRT used by Optimizely) let you peek at predefined checkpoints. The price: each checkpoint requires a more stringent threshold. With 10 planned peeks, you need to see p < 0.01 (not 0.05) at each peek to maintain a true 5% overall error rate. Flexibility to stop early costs statistical power at each checkpoint.

The Multiple Comparisons Trap.

Your test tracks 20 metrics. Conversion is flat, but "time on confirmation page" is significant at p = 0.03! Ship it?

No. With 20 metrics at α = 0.05, you have a 64% chance (1 - 0.95²⁰) of at least one false positive. The significant result is probably noise.

Corrections exist: Bonferroni (divide α by the number of tests — safe but brutally conservative) and Benjamini-Hochberg FDR (controls the proportion of false discoveries rather than the probability of any false discovery — better power, still rigorous). The practical approach from Kohavi et al.: distinguish between decision metrics (need correction), guardrail metrics (need monitoring), and debug/learning metrics (don't need correction). Apply correction only where it drives the ship/no-ship decision.

The Winner's Curse — the silent death spiral.

Run an underpowered test. The only way it reaches significance is if random noise inflates the observed effect. So your "significant" result overestimates the true effect by 2-10x. You ship with inflated expectations. Real-world performance disappoints. Your team loses faith in experimentation. They invest less, run even more underpowered tests, and the cycle accelerates. This is not a theoretical concern — it's the most common failure mode of experimentation programs, and it's invisible because you never go back to re-measure the true effect after shipping.

SUTVA Violations — when users aren't independent.

The Stable Unit Treatment Value Assumption says one user's outcome is unaffected by another's treatment. This breaks in marketplaces (giving Uber riders discounts affects driver supply for control riders), social networks (treating a user changes their friends' feeds), and any shared-resource system. Solutions like cluster randomization and switchback experiments exist but are incomplete — cluster randomization dramatically reduces effective sample size (N users in 10 geographic clusters gives you 10 data points, not N), and defining good clusters requires knowing the interference structure in advance.

Key Insight: The winner's curse creates a feedback loop with no natural correction. You overestimate effects, ship based on inflated numbers, never re-measure, and believe your experimentation program is generating 10% lifts when it's actually generating 2%. No one ever discovers the discrepancy.

Check your understanding:
- Your team runs a test, peeks at day 3 and sees p = 0.12, then peeks at day 7 and sees p = 0.04. They celebrate and ship. Explain, using the random walk analogy, why this result is unreliable — and what they should have done instead.
- A PM argues that Bonferroni correction is "too conservative" and wants to skip it. Under what conditions might they have a point, and when would skipping it be genuinely dangerous?

The Mental Models Worth Keeping

1. The Courtroom Trial Model
An experiment is a trial with rules set before evidence is examined. Hypothesis, metrics, sample size, and stopping rules are the "rules of evidence." Changing them after data arrives is like a judge rewriting the law after hearing testimony. Use it when: you're tempted to extend a test, add a metric, or re-segment after seeing results.

2. The Error Budget Model
Every experiment has a fixed "budget" of allowable error (α). Each peek, each metric, each comparison spends some of that budget. More comparisons = less budget per comparison = lower power everywhere. Use it when: stakeholders want to track 30 metrics, or you're planning how many interim analyses to allow.

3. The Experimentation Triangle
Hypothesis → Metric → Sample Size. All three must be defined before the experiment starts, and they're interdependent. Change the MDE and you change the required sample. Change the metric and you change the variance. Use it when: planning any new test.

4. Twyman's Law ("Too good to be true = probably isn't true")
"Any figure that looks interesting or different is usually wrong." The more extreme a result, the more likely it's a bug, not a breakthrough. Use it when: you see a 40% lift in a test. Check the instrumentation before celebrating.

5. The Decision Metric Hierarchy
OEC (primary ship/no-ship metric) → Guardrail metrics (things that must NOT degrade) → Debug metrics (diagnostics) → Learning metrics (exploration). Different statistical standards apply to each tier. Use it when: deciding which metrics need multiple comparison correction and which don't.

What Most People Get Wrong

1. "Bayesian A/B testing lets you peek anytime"
Why people believe it: Several popular testing tools market this claim. Bayesian posteriors feel like they should update naturally.
What's actually true: Bayesian tests with fixed posterior thresholds suffer the same false positive inflation as frequentist tests when you stop on success. Only sequential testing methods — whether frequentist or Bayesian — with proper stopping rules solve peeking. David Robinson demonstrated this formally.
How to tell in the wild: Ask whether the tool implements sequential methods with calibrated stopping rules, or just computes a posterior and lets you stop whenever you like.

2. "p < 0.05 means 95% chance the treatment works"
Why people believe it: It's what everyone wants p-values to mean, and the correct interpretation is genuinely counterintuitive.
What's actually true: p = 0.05 means "if there's truly no effect, data this extreme would occur 5% of the time." The probability the treatment works requires Bayesian inference with a prior.
How to tell in the wild: When someone states a p-value, ask them: "the probability of what, given what?" If they can't answer precisely, they're misinterpreting it.

3. "Not significant = no effect"
Why people believe it: It feels like the logical contrapositive. If we can't prove it works, it doesn't work.
What's actually true: A non-significant result may simply mean the test was underpowered — there weren't enough users to detect a real but small effect. Absence of evidence is not evidence of absence.
How to tell in the wild: Check the test's power. If it was powered to detect a 10% lift but the true effect is 2%, the test will almost certainly come back non-significant — and that tells you nothing.

4. "The bigger the observed lift, the better"
Why people believe it: Bigger numbers feel better. A 15% lift is more exciting than a 2% lift.
What's actually true: In underpowered tests, large observed effects are more likely to be inflated by noise (the winner's curse / Type M error). The more surprising a result, the more scepticism it deserves.
How to tell in the wild: Compare the observed effect to the MDE the test was powered for. If the effect is much larger than what you were powered to detect, check the confidence interval — its lower bound is a more realistic estimate of the true effect.

5. "Just run it longer until it's significant"
Why people believe it: It seems like more data should help. And sometimes p-values do cross 0.05 with more data.
What's actually true: Extending a test based on observed results breaks the fixed-sample assumption. The decision to extend was influenced by the data, which changes the statistical properties of the test. You must plan the duration upfront. If you want flexibility, use sequential testing from the start.
How to tell in the wild: Ask "was the decision to extend made before or after looking at the data?" If after, the results are compromised.

The 5 Whys — Root Causes Worth Knowing

Why does peeking inflate false positives?
→ Because p-value calibration assumes a fixed sample size → Because the test statistic follows its known distribution only at the pre-specified N → Because a random walk will eventually cross any fixed threshold → Because boundary crossings accumulate over time (Brownian motion) → Because the significance level was computed for one evaluation, not many.
Root insight: Repeated checking turns a single test into multiple tests. Sequential testing methods fix this by adjusting the boundary at each checkpoint — but the price is lower power per peek.

Why do 90% of A/B tests fail to show improvement?
→ Because human intuition about user behaviour is unreliable → Because product builders suffer from the curse of knowledge → Because user behaviour depends on complex, context-dependent cognition → Because the interaction of a specific population, context, and design is irreducibly complex → Because there's no general theory of user behaviour.
Root insight: This 90% failure rate isn't a bug — it's the reason experimentation exists. If intuition were reliable, you wouldn't need to test. The program's job is to cheaply sort the 10% of winners from the 90% of non-winners.

Why is the winner's curse systematic, not occasional?
→ Because underpowered tests detect effects only when noise inflates them → Because the significance threshold selectively passes overestimates → Because organisations set unrealistically large MDEs to shrink sample requirements → Because there's pressure to test many ideas quickly → Because experimentation programs are evaluated on velocity (tests per quarter), not quality (power, replication rate).
Root insight: The winner's curse is invisible because you never learn the true effect after shipping. There's no feedback loop. The organisation believes it's making 10% improvements when it's actually making 2% improvements, and nobody discovers the discrepancy.

Why does p-value misinterpretation persist despite decades of education?
→ Because the correct interpretation is counterintuitive → Because people want P(H₁|data) but get P(data|H₀) → Because converting between them requires Bayes' theorem and a prior → Because frequentist philosophy refuses to assign probabilities to hypotheses → Because statistics education teaches procedure (calculate, compare) rather than understanding (what does this number mean?).
Root insight: Human cognition is natively Bayesian — we evolved to update beliefs from evidence ("rustle in bushes → probably a predator"). The frequentist framework fights our natural reasoning, and education hasn't bridged the gap because the hybrid NHST framework taught in most programs was never internally coherent.

Why does SRM occur in 6-10% of all tests?
→ Because randomization or data pipelines have systematic errors → Because SRM can originate at assignment, execution, logging, or analysis stages → Because experimentation platforms are complex distributed systems → Because the path from user request to analysis has many failure points → Because reliable randomization at scale is harder than it appears (caching, race conditions, redirects, cookie deletion).
Root insight: SRM is especially dangerous because it can produce either false positives or false negatives, and there's no way to know which direction the bias goes. The 6-10% rate may be a lower bound — additional SRM may exist below detection thresholds.

The Numbers That Matter

5% → 26%: The false positive rate inflation when you peek at results 10 times during a test. That's not a marginal increase — you've made your test five times more likely to give you a wrong answer. To maintain a true 5% rate with 10 peeks, each peek needs to clear p < 0.01.

~10% experiment success rate: Across 20,000 experiments analysed by Optimizely, only about one in ten showed statistically significant improvement. To put that in perspective: if your experimentation program has a 50% "win rate," you're either testing only safe, obvious ideas — or you're not being rigorous enough with your statistics.

6-10% SRM prevalence: Roughly 1 in 12 to 1 in 16 A/B tests have broken randomization. That's like discovering that 1 in 12 clinical trials accidentally gave some patients the wrong pill. Any test with SRM should be discarded regardless of its p-value.

64% chance of a false positive with 20 metrics: Track 20 metrics at α = 0.05 without correction, and you're more likely than not to "find" something that isn't there. It's like rolling a die 20 times and being surprised when a 6 comes up.

2-10x winner's curse exaggeration: In underpowered studies, the observed "significant" effect can overstate reality by a factor of two to ten. That "12% lift" your team shipped? Quite possibly a 2% lift in disguise.

50%+ variance reduction from CUPED: Microsoft's method uses pre-experiment user behaviour as a covariate, cutting required sample size roughly in half. That's like doubling your site's traffic for free — but only works for returning users with sufficient historical data. Optimal pre-experiment window: 1-2 weeks.

95% surrogate metric consistency at Netflix: Across 200 experiments, Netflix's short-term proxy metrics matched long-term outcomes 95% of the time. This means decisions about annual retention can potentially be made within days — a transformative acceleration of learning velocity.

80% → 90% power costs 30% more sample: Moving from "detect a real effect 80% of the time" to "90% of the time" requires roughly 30% more users. That's the diminishing returns of statistical power — each increment costs progressively more.

Where Smart People Disagree

Bayesian vs. Frequentist A/B Testing
The disagreement isn't really about which framework is "better" — both give similar answers when properly implemented. The real fight is about communication and stopping rules. Bayesian advocates argue that posteriors ("73% probability B is better") are more useful to business stakeholders than p-values. Frequentists counter that error rate guarantees are mathematically stronger and don't require specifying priors, which can be controversial. The emerging consensus: the framework matters less than whether you use sequential testing methods — which exist in both paradigms.

Should α = 0.05 be the standard?
Benjamin et al. argued in a prominent 2018 paper for α = 0.005, noting that p = 0.05 corresponds to Bayesian evidence ratios of only about 3:1 against the null — hardly compelling. But for individual experimenters at low-traffic sites, 0.005 would make most tests impossibly expensive. The pragmatic view: the threshold should depend on the asymmetric costs of false positives vs. false negatives, which vary by context. A company running thousands of tests needs stricter thresholds than one running ten.

Multi-armed bandits vs. classical A/B tests
Bandits adaptively shift traffic to winning variants during the experiment, minimising "regret" (lost revenue from showing the worse option). But they sacrifice statistical confidence in the final answer. The unresolved tension: bandits are great for transient decisions (email subject lines, ad copy) where you optimise and move on, but dangerous for permanent product decisions where you need to know the truth. The debate centres on what you're optimising for — knowledge or immediate performance.

How aggressively to correct for multiple comparisons
Bonferroni is safe but so conservative it can make real effects undetectable. Benjamini-Hochberg FDR is a better power tradeoff. Some practitioners argue that in exploratory analyses, any "hit" is worth investigating further without correction. Kohavi et al. offer the practical resolution: correct decision metrics, don't correct debug/learning metrics.

What You Don't Know Yet (And That's OK)

After absorbing this guide, here's where your knowledge runs out:

Long-term effect estimation remains largely unsolved. Most tests run 1-4 weeks, but many product decisions play out over months or years. Netflix's surrogate metrics work in 95% of cases, but building them is labour-intensive and domain-specific.

Interference in networked systems — social networks, marketplaces, shared-resource platforms — breaks the fundamental independence assumption. Current solutions (cluster randomization, switchback experiments) are imperfect. How to run valid experiments when users influence each other is one of the hardest open problems in causal inference.

Interaction effects between concurrent experiments become a real problem when you run thousands of tests simultaneously. Layered randomisation helps but isn't perfect. Detecting which experiments interact with each other at scale is computationally daunting.

Heterogeneous treatment effects — discovering who benefits from a change, not just whether there's an average effect — is an active research frontier. The challenge is statistical: subgroup analysis multiplies your comparisons, inflating error rates, and subgroup effects are correlational, not causal, unless you randomised by subgroup.

The ethics of optimisation — where A/B testing ends and manipulation begins — has no consensus. The line between "personalising the experience" and "exploiting dark patterns" is blurry, and regulatory frameworks (GDPR, EU Digital Services Act) are still catching up.

The local maximum problem — A/B testing optimises within a given design space but cannot discover that a fundamentally different approach would be superior. How to combine incremental testing with exploratory innovation remains philosophically unresolved.

Subtopics to Explore Next

1. Sequential Testing Methods (Pocock, O'Brien-Fleming, mSPRT)
Why it's worth it: This is the actual solution to the peeking problem — the single most common source of invalid results in practice.
Start with: Evan Miller's "How Not To Run an A/B Test," then Netflix Tech Blog's series on sequential testing.
Estimated depth: Medium (half day)

2. Power Analysis and Sample Size Calculation
Why it's worth it: Unlocks the ability to determine, before you start, whether a test is even worth running — saving weeks of wasted effort.
Start with: Evan Miller's sample size calculator (evanmiller.org) and work backwards from real baseline rates and MDEs.
Estimated depth: Medium (half day)

3. CUPED and Variance Reduction Techniques
Why it's worth it: Effectively doubles your experimental throughput by cutting required sample sizes in half — free statistical power.
Start with: Matteo Courthoud's "Understanding CUPED" blog post, then Microsoft Research's deep dive.
Estimated depth: Medium (half day)

4. The Multiple Comparisons Problem in Depth
Why it's worth it: Understanding Bonferroni vs. BH-FDR vs. no correction — and when each is appropriate — is the difference between rigorous and sloppy analysis.
Start with: Statsig's "Multiple Comparison Corrections" post, then the Kohavi decision metric hierarchy.
Estimated depth: Surface (1-2 hours)

5. Bayesian A/B Testing — Properly Implemented
Why it's worth it: Unlocks posterior probabilities ("73% chance B is better") that are genuinely more useful for business decisions — if implemented with proper stopping rules.
Start with: David Robinson's "Is Bayesian A/B Testing Immune to Peeking?" on Variance Explained.
Estimated depth: Deep (multi-day)

6. Causal Inference Beyond Randomisation
Why it's worth it: For situations where you can't randomise — ethical constraints, marketplace dynamics, policy evaluation — you need diff-in-diff, regression discontinuity, and instrumental variables.
Start with: The convergence between the A/B testing world and the causal inference world, starting from SUTVA violations.
Estimated depth: Deep (multi-day)

7. Experimentation Platform Architecture
Why it's worth it: Understanding how feature flags, randomisation, and logging systems work — and fail — explains why SRM occurs in 6-10% of tests and how to prevent it.
Start with: DoorDash's blog on addressing SRM challenges, then Microsoft Research's SRM diagnostics article.
Estimated depth: Medium (half day)

8. Metrics Design — OEC, Guardrails, and Goodhart's Law
Why it's worth it: The metric you optimise for determines what your experimentation program actually achieves. Bad metrics produce confident wrong answers.
Start with: ABsmartly's post on OEC, then the Airbnb guardrail example (hiding house rules at checkout).
Estimated depth: Surface (1-2 hours)

Key Takeaways

An experiment with rules set after seeing data is not an experiment — it's a story told to justify a decision already made.
90% of ideas fail, and that's the point — if your win rate is high, you're not testing bold enough ideas.
Peeking doesn't just add noise; it systematically inflates false positives — 10 peeks turns a 5% error rate into a 26% error rate.
The observed effect size from an underpowered test is not your best estimate of the true effect — it's a systematic overestimate, and the lower bound of the confidence interval is more realistic.
"Not significant" tells you almost nothing unless you know the test had adequate power to detect the effect size you care about.
Bayesian methods don't magically fix frequentist problems — both frameworks require sequential testing methods to handle peeking; the framework choice matters less than correct implementation.
Every additional metric you track spends from a fixed error budget — 20 uncorrected metrics at α = 0.05 gives you a 64% chance of a spurious finding.
SRM invalidates everything — if the sample ratio is off, no amount of statistical analysis can save the result. Check it first, always.
Twyman's Law is your best debugging heuristic — the more extraordinary the result, the more likely it's a measurement error, not a breakthrough.
A flat result is often the most valuable outcome — it prevents costly development work and should be celebrated, not treated as failure.
Experimentation velocity without rigour creates false confidence — 20 well-powered tests beat 100 underpowered ones.
The threshold α = 0.05 is a convention, not a truth — the right threshold depends on the costs of false positives vs. false negatives in your specific context.
Organisational incentives systematically undermine statistical rigour — PMs rewarded for "wins" will unconsciously p-hack, peek, and cherry-pick. Fix the incentives or accept corrupted results.
Pre-registration is the single cheapest intervention that improves experiment quality — define your hypothesis, metrics, sample size, and stopping rules before you start.

Sources Used in This Research

Primary Research:
- Johari et al. (2015/2021), "Always Valid Inference: Continuous Monitoring of A/B Tests," Operations Research
- Larsen et al. (2023), "Statistical Challenges in Online Controlled Experiments: A Review," The American Statistician
- Kohavi et al. (2020), "Online Randomized Controlled Experiments at Scale," Trials/Springer
- Perezgonzalez (2015), "Fisher, Neyman-Pearson or NHST? A Tutorial," PMC
- Kohavi, Tang & Xu (2020), Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing, Cambridge University Press
- Georgiev (2019), Statistical Methods in Online A/B Testing
- Papers on interference (arXiv 2024), variance reduction (arXiv 2024), multiple testing (PMC), p-value misinterpretation (PMC), experimentation ethics (Springer 2023), novelty/primacy effects (arXiv 2021)

Expert Commentary:
- Evan Miller, "How Not To Run an A/B Test"
- David Robinson, "Is Bayesian A/B Testing Immune to Peeking? Not Exactly," Variance Explained
- Netflix Tech Blog on sequential testing
- Microsoft Research on SRM diagnostics and CUPED
- Etsy Engineering on winner's curse mitigation
- Airbnb Engineering on selection bias
- DoorDash Engineering on SRM challenges
- Nubank Engineering on CUPED implementation
- Harvard Data Science Review on scaling experimentation (2022)
- GrowthBook, Statsig, and ABsmartly documentation

Good Journalism:
- CXL: Bayesian vs. Frequentist A/B Testing
- PostHog: A/B Testing Mistakes; Guardrail Metrics
- Unbounce: Hypothesis formulation; Surprising A/B test results
- VWO: A/B Testing Hypothesis creation
- Towards Data Science: Why Most A/B Tests Are Lying to You
- Braze: Multi-Armed Bandit vs. A/B Testing
- Adobe: Common A/B Testing Pitfalls

Reference:
- Evan Miller's A/B Testing Tools & Sample Size Calculator
- GrowthBook Docs: Multiple Testing Corrections
- Lukas Vermeer's SRM Checker
- Wikipedia: A/B Testing; Multiple Comparisons Problem