Leniency Designs: Why You’ve Been Doing Them Wrong (And How to Fix It)

causal inference

instrumental variables

methodology

A breakdown of Goldsmith-Pinkham, Hull, and Kolesár (2025), showing why standard 2SLS fails in judge IV designs with many instruments and how UJIVE solves the problem. Interactive simulations reveal the bias.

Author

Jacob Jameson

Published

November 20, 2025

Introduction

If you’ve published a paper using a judge IV design, you need to read Goldsmith-Pinkham, Hull, and Kolesár (2025). It might change your conclusions.

Leniency designs have become one of the most popular identification strategies in applied micro. The setup is clean: randomly assigned decision-makers (judges, examiners, loan officers) vary in their leniency, creating quasi-experimental variation in treatment. Over the past decade, these designs have powered influential papers on bail decisions (Dobbie et al. 2018), patent values (Farre-Mensa et al. 2020), disability insurance (Maestas et al. 2013), and dozens of other topics.

The standard approach is straightforward: use two-stage least squares (2SLS), instrumenting treatment with examiner fixed effects. Random assignment ensures exogeneity, variation in leniency ensures relevance, and you’re done.

Or so we thought.

The Paper’s Main Contribution

Goldsmith-Pinkham, Hull, and Kolesár show that standard 2SLS systematically fails in leniency designs when you have many decision-makers. Not “might be slightly biased” or “could be inefficient”—it fundamentally breaks down. The problem isn’t weak instruments in the traditional sense. Even when your first-stage F-statistic looks strong, 2SLS delivers:

Biased point estimates (pulled toward OLS)
Artificially small standard errors (creating false precision)
Invalid inference (your t-stats are wrong)

The culprit is a subtle mechanical correlation: when estimating examiner \(j\)’s leniency, 2SLS includes observation \(i\)’s own treatment status in the calculation. This creates correlation between your instrument and the error term—exactly what IV is supposed to avoid.

The Solution

The paper proposes the Unbiased Jackknife Instrumental Variables Estimator (UJIVE), which uses leave-one-out estimation to break the mechanical correlation. When constructing the instrument for observation \(i\), UJIVE estimates all examiner leniencies using data excluding observation \(i\). Simple idea, big consequences.

What This Post Covers

I’ll walk you through the paper’s key results using interactive simulations that let you see the bias in real-time. We’ll cover:

Quick refresher on leniency designs (you probably know this, but let’s set notation)
The many-weak instrument problem (why 2SLS fails, with simulations)
Why standard errors are wrong too (not just point estimates)
The UJIVE solution (how leave-one-out fixes everything)
Empirical re-analysis (Farre-Mensa et al. 2020 on patents)
Practical guidelines (5-step checklist for your next paper)

The bottom line: if you’re using leniency designs, you should probably switch from 2SLS to UJIVE. Let me show you why.

Quick Refresher: The Leniency Design

You probably know this cold, but let’s establish notation. Consider the outcome equation:

\[y_i = \gamma + \beta x_i + \varepsilon_i\]

where:

\(y_i\) is the outcome (e.g., future innovation for startup \(i\))
\(x_i \in \{0,1\}\) is treatment (e.g., patent approval)
\(\beta\) is the causal effect we want
\(\varepsilon_i\) captures unobservables

The identification problem: \(x_i\) and \(\varepsilon_i\) are correlated. Patents go to better startups, bail is granted to safer defendants, etc. OLS is biased.

The leniency design solution: Cases are randomly assigned to decision-makers \(j = 1, \ldots, K\) who vary in leniency. Let \(z_i\) be a vector of examiner indicators (one for each examiner, minus a reference category). The first-stage regression is:

\[x_i = z_i'\pi + w_i'\delta + \nu_i\]

where \(w_i\) are necessary controls (e.g., art unit × year FE in patent setting) and \(\pi\) captures examiner leniencies relative to the omitted examiner.

Why it works: Random assignment means \(z_i \perp \varepsilon_i | w_i\). Variation in leniency means \(\pi \neq 0\). Standard 2SLS instruments with \(z_i\) (controlling for \(w_i\)) to estimate \(\beta\).

The Simple Two-Examiner Case

To build intuition, consider just two examiners: one tough (\(t\)), one soft (\(s\)). Let \(p_t\) and \(p_s\) be their approval rates, with \(p_s > p_t\). With a single binary instrument \(z_i \in \{0,1\}\) indicating assignment to the soft examiner, the IV estimator simplifies to the Wald estimator:

\[\hat{\beta}_{IV} = \frac{E[y_i | z_i = 1] - E[y_i | z_i = 0]}{E[x_i | z_i = 1] - E[x_i | z_i = 0]} = \frac{\bar{y}_s - \bar{y}_t}{p_s - p_t}\]

Let’s see this in action:

Figure 1: The Two-Examiner IV Estimator

Tough examiner approval rate: 0.40

Soft examiner approval rate: 0.70

With just two examiners, IV works beautifully. The estimate recovers the true effect. Standard errors are straightforward (het-robust, no clustering needed since assignment is iid).

The key insight: This works because examiner assignment is uncorrelated with \(\varepsilon_i\) by random assignment. The Wald estimator identifies a local average treatment effect (LATE) for compliers—cases that would be approved by the soft examiner but denied by the tough one.

The Problem: Many Instruments

In practice, you never have two examiners. Patent offices have hundreds of examiners, courts have dozens of judges, disability offices have many screeners. With \(K\) examiners, you have \(K-1\) instruments (the examiner dummies).

This is where things break down.

The Mechanical Correlation

The standard 2SLS approach:

First stage: Regress \(x_i\) on all examiner dummies \(z_i\) (and controls \(w_i\))
Get predicted values: \(\hat{x}_i = z_i'\hat{\pi} + w_i'\hat{\delta}\)
Second stage: Regress \(y_i\) on \(\hat{x}_i\) (and controls)

Here’s the problem: \(\hat{x}_i\) is the predicted approval probability for application \(i\) based on examiner \(j\)’s approval rate. But that approval rate is calculated using all observations assigned to examiner \(j\)—including observation \(i\) itself!

If examiner \(j\) handled 50 cases and observation \(i\) is one of them, then \(\hat{x}_i\) is mechanically correlated with \(x_i\). And since \(x_i\) is correlated with \(\varepsilon_i\) (that’s why we need IV!), this means \(\hat{x}_i\) is correlated with \(\varepsilon_i\).

Your instrument is contaminated.

The Bias Formula

Under homoskedasticity, the paper shows that 2SLS bias can be approximated as:

\[\text{Bias}(2SLS) \approx \text{Bias}(OLS) \times \frac{1}{E[F]}\]

where \(E[F]\) is the expected value of the first-stage F-statistic.

The famous “F > 10” rule of thumb says you want \(E[F] > 10\) to keep 2SLS bias below 10% of OLS bias. But with many examiners, \(E[F]\) can be small even when examiners collectively explain meaningful variation, because \(F\) is divided by \(K-1\).

Key insight from the paper: The bias isn’t just about weak instruments in the traditional sense. Even with “strong enough” first stages, the own-observation contamination creates bias.

Simulation: Watching the Bias Grow

Let’s see this in action. I’ll simulate data with many examiners and show how 2SLS gets pulled toward OLS as \(K\) increases:

Figure 2: The Many-Examiner Bias Problem

Number of examiners (K): 50

Sample size (n): 1000

What you’re seeing: As \(K\) increases relative to \(n\), 2SLS (blue) gets pulled toward OLS (red). UJIVE (purple) stays centered on the true effect. This isn’t a weak instruments problem in the traditional sense—the examiners collectively explain meaningful variation. It’s the mechanical correlation from including own-observation data.

The Standard Error Problem

But wait, there’s more! The bias in point estimates is only half the story. The paper shows that 2SLS standard errors are also wrong—and in a way that masks the bias problem.

Look at the distributions above. Notice how 2SLS (blue) is super concentrated? That narrow spike means small variance, which means small standard errors. But those small SEs aren’t reflecting true precision—they’re an artifact of the same overfitting that causes the bias.

Here’s what happens:

2SLS overstates the predictive power of the instruments (includes own-observation data)
This inflates the denominator in the IV formula: \(\sum \hat{\ell}_i^2\) instead of \(\sum \tilde{\ell}_i^2\)
The SE formula has that same denominator, so SEs are too small
Result: you get tight confidence intervals around a biased estimate

This is dangerous. You see small p-values and think you’ve precisely estimated a large effect. In reality, you’ve imprecisely estimated a smaller effect.

UJIVE fixes both problems. That wider purple distribution reflects honest uncertainty—larger SEs that correctly account for: - Estimation error in leniency measures - Heterogeneous treatment effects across complier groups
- Many-instrument uncertainty

The tradeoff: UJIVE sacrifices some efficiency (larger SEs) to eliminate bias and get correct inference. 2SLS looks precise but is both biased and has wrong SEs. You’d rather have honest uncertainty around an unbiased estimate than false precision around a biased one.

The Solution: UJIVE

The fix is elegant. Instead of including observation \(i\) when estimating its examiner’s leniency, leave it out.

The Leave-One-Out Principle

For each observation \(i\) assigned to examiner \(j\):

Estimate examiner \(j\)’s leniency using all observations except \(i\): \[\hat{\ell}_{-i}\]
Use this leave-one-out leniency as the instrument for \(i\)
Repeat for all observations

This breaks the mechanical correlation. Since \(\hat{\ell}_{-i}\) doesn’t depend on \(x_i\), it can’t be mechanically correlated with \(\varepsilon_i\).

Let me visualize exactly what this means:

Figure 3: The Leave-One-Out Principle

The UJIVE Estimator

Formally, UJIVE is:

\[\hat{\beta}_{UJIVE} = \frac{\sum_i \hat{\ell}_{-i} y_i}{\sum_i \hat{\ell}_{-i} x_i}\]

where \(\hat{\ell}_{-i}\) is the leave-one-out predicted leniency for observation \(i\).

Why this works: Since \(\hat{\ell}_{-i}\) doesn’t use \(i\)’s own data, it’s independent of \((x_i, \varepsilon_i)\) conditional on \(w_i\). No mechanical correlation, no bias.

Key properties: - Approximately unbiased even with many weak instruments - Correct standard errors accounting for heterogeneous effects - Doesn’t require strong first stage (\(F > 10\) rule doesn’t apply) - Computationally simple (one-step estimator)

Does It Actually Work?

Let’s run a big simulation comparing all three estimators:

Figure 4: Full Comparison - 1,000 Simulations

The key takeaway: Look at the standard deviations in the table. UJIVE has larger SD (and thus larger standard errors) than 2SLS. This is a feature, not a bug. UJIVE’s wider distribution reflects honest uncertainty—it correctly accounts for estimation error in leniency measures and heterogeneous effects. 2SLS’s tight distribution is false precision.

You’d rather have truthful uncertainty around an unbiased estimate than false confidence around a biased one.

Empirical Application: Patent Values Revisited

Now let’s see this in practice. The paper re-analyzes Farre-Mensa, Hegde, and Ljungqvist (2020), who use patent examiner assignment to estimate how patent approval affects startup innovation.

Setting: - 32,514 first-time patent applications by US startups (2001-2013) - ~1,200 patent examiners - Random assignment within art unit × year - Outcomes: Future patent applications, approvals, citations

Original approach: Constructed leniency measure (similar to JIVE)

This paper’s re-analysis: Compare UJIVE, 2SLS with examiner dummies, and OLS

Results

Outcome	UJIVE	2SLS (examiners)	OLS
Any subsequent application	0.173 (0.055)	0.232 (0.016)	0.234 (0.006)
Log(1 + applications)	0.323 (0.100)	0.374 (0.027)	0.357 (0.009)
Any subsequent approval	0.259 (0.050)	0.240 (0.014)	0.223 (0.005)
Log(1 + approvals)	0.356 (0.081)	0.323 (0.021)	0.291 (0.007)
Any citations	0.183 (0.049)	0.173 (0.014)	0.164 (0.005)
Log(1 + citations)	0.419 (0.125)	0.372 (0.033)	0.339 (0.011)

Standard errors in parentheses

What Changed?

Three things to notice:

Point estimates: UJIVE estimates are somewhat smaller than 2SLS for some outcomes (e.g., “Any subsequent application”: 0.173 vs 0.232). The 2SLS estimates were being pulled toward OLS (0.234).
Standard errors: UJIVE SEs are 3-4× larger! For “Any subsequent application”: UJIVE SE is 0.055 vs 2SLS SE of 0.016. This isn’t UJIVE being inefficient—2SLS SEs were artificially small.
Substantive conclusions: Effects are still significant and economically meaningful:
- Patent approval increases probability of future applications by 17pp
- Increases probability of future approvals by 26pp
- Increases citations by 18pp

But the effects are more modest than 2SLS suggested, and we’re appropriately less certain about them.

A Practical Checklist

The paper provides a 5-step guide for implementing leniency designs. Here’s my condensed version:

1. Identify Necessary Controls

Use institutional knowledge to determine what makes assignment as-good-as-random.

Patent example: Assignment is random within art unit × year, so these are necessary controls.

2. Test Balance

Run UJIVE with predetermined covariates as outcomes. Significant coefficients = red flag.

Why UJIVE for balance tests? Same estimator, consistent approach. Other approaches (regressing covariates on constructed leniency) can show spurious imbalance due to finite-sample bias.

3. Estimate with UJIVE

Use UJIVE as primary estimator. Report 2SLS and OLS for comparison.

Software: Authors provide R package at github.com/kolesarm/ManyIV

library(ManyIV)
result <- ujive(
  formula = outcome ~ treatment | examiner_dummies | controls,
  data = your_data
)

4. Test Monotonicity

For heterogeneous effects interpretation (LATE), you need monotonicity: no defiers.

Test: Run UJIVE with outcome = indicator(outcome value) × treatment. Estimates should be ∈ [0,1].

What it tests: “Average monotonicity”—average leniency of examiners who’d approve exceeds those who’d deny.

5. Characterize Compliers

Run UJIVE with covariate × treatment as outcome to estimate complier characteristics.

Why: Check external validity. If compliers look very different from full sample, LATE estimates may not generalize.

References

Dobbie, W., Goldin, J., & Yang, C. S. (2018). The effects of pre-trial detention on conviction, future crime, and employment: Evidence from randomly assigned judges. American Economic Review, 108(2), 201-240.

Farre-Mensa, J., Hegde, D., & Ljungqvist, A. (2020). What is a patent worth? Evidence from the U.S. patent “lottery”. The Journal of Finance, 75(2), 639-682.

Goldsmith-Pinkham, P., Hull, P., & Kolesár, M. (2025). Leniency Designs: An Operator’s Manual. arXiv:2511.03572

Imbens, G. W., & Angrist, J. D. (1994). Identification and estimation of local average treatment effects. Econometrica, 62(2), 467-475.

Kolesár, M. (2013). Estimation in an instrumental variables model with treatment effect heterogeneity. Working paper, Princeton University.

Maestas, N., Mullen, K. J., & Strand, A. (2013). Does disability insurance receipt discourage work? Using examiner assignment to estimate causal effects of SSDI receipt. American Economic Review, 103(5), 1797-1829.