Biased and Debiased Machine Learning in Causal Inference

The application of machine learning methods in estimating treatment effects is a burgeoning area in causal inference. The regression function gg and the propensity score ee used to estimate the average treatment effects can be estimated using machine learning methods like random forests. However, the naive applications of machine learning methods to estimate treatment effects lead to biased estimates due to regularization bias. The proof examines the asymptotic behavior of the estimator.

Proof Process

  1. Define the Model and Estimator:

    • Assume a partially linear regression model where YY is the outcome, ZZ is the treatment indicator, and XX represents covariates.
    • The regression function g(X)g(X) and the propensity score e(X)e(X) are modeled non-parametrically using machine learning methods.
  2. Assumptions:

    • Treatment assignment ZZ is unconfounded given covariates XX .
    • Residuals (errors) satisfy usual assumptions (e.g., mean zero, finite variance).
  3. Estimator Definition:

    • Split the data into two groups.
    • Estimate g(X)g(X) using the second group with a machine learning method.
    • Regress Yg^(X)Y−\hat{g}(X) on ZZ in the first group to obtain the estimator θ^\hat{\theta} for the treatment effect.
  4. Rewrite the Estimator:

    • Express the estimator θ^\hat{\theta} in a form that separates the effect of g(X)g(X) and the residuals:

      θ^=1N1itreated[Yig^(Xi)]1N0icontrol[Yig^(Xi)]\hat{\theta} = \frac{1}{N_1} \sum_{i \in \text{treated}} \left[ Y_i - \hat{g}(X_i) \right] - \frac{1}{N_0} \sum_{i \in \text{control}} \left[ Y_i - \hat{g}(X_i) \right]
  5. Decompose into Asymptotic Terms:

    • Decompose θ^\hat{\theta} into terms that reveal the contributions of the estimation error from g^(X)\hat{g}(X) and the residuals:

      θ^=θ+1N1itreated[ϵi(g^(Xi)g(Xi))]1N0icontrol[ϵi(g^(Xi)g(Xi))]\hat{\theta} = \theta + \frac{1}{N_1} \sum_{i \in \text{treated}} \left[ \epsilon_i - (\hat{g}(X_i) - g(X_i)) \right] - \frac{1}{N_0} \sum_{i \in \text{control}} \left[ \epsilon_i - (\hat{g}(X_i) - g(X_i)) \right]
  6. Analyze Asymptotic Distribution:

    • First Term: Under usual regularity conditions, the term involving ϵi\epsilon_i (residuals) converges in distribution to a normal distribution.
    • Second Term: The term involving g^(Xi)g(Xi)\hat{g}(X_i) - g(X_i) represents the bias due to regularization in the machine learning method. This term does not have mean zero and can diverge to infinity.
  7. Conclusion on Bias:

    • The second term introduces bias because the machine learning method’s regularization bias does not vanish asymptotically.
    • This results in the estimator θ^\hat{\theta} being asymptotically biased, even if the first term converges to a normal distribution.

Detailed Example

  1. Model Setup:

    • Let Y=Zθ+g(X)+ϵY=Zθ+g(X)+ϵ
    • Estimate g(X)g(X) using a machine learning method (e.g., random forest).
  2. Estimator:

    • Split data into two parts.

    • Use part one to estimate g^(X)\hat{g}(X).

    • Use part two to calculate θ^\hat{\theta}:

      θ^=1N1i=1N1(Yig^(Xi))Zi1N0i=1N0(Yig^(Xi))(1Zi)\hat{\theta} = \frac{1}{N_1} \sum_{i=1}^{N_1} (Y_i - \hat{g}(X_i)) Z_i - \frac{1}{N_0} \sum_{i=1}^{N_0} (Y_i - \hat{g}(X_i)) (1 - Z_i)
  3. Asymptotic Analysis:

    • Rewrite the estimator to separate terms:

      θ^=θ+(1N1i=1N1ϵi1N0i=1N0ϵi)(1N1i=1N1(g^(Xi)g(Xi))1N0i=1N0(g^(Xi)g(Xi)))\hat{\theta} = \theta + \left( \frac{1}{N_1} \sum_{i=1}^{N_1} \epsilon_i - \frac{1}{N_0} \sum_{i=1}^{N_0} \epsilon_i \right) - \left( \frac{1}{N_1} \sum_{i=1}^{N_1} (\hat{g}(X_i) - g(X_i)) - \frac{1}{N_0} \sum_{i=1}^{N_0} (\hat{g}(X_i) - g(X_i)) \right)
  4. Identify Bias Term:

    • The second term 1N1i=1N1(g^(Xi)g(Xi))1N0i=1N0(g^(Xi)g(Xi))\frac{1}{N_1} \sum_{i=1}^{N_1} (\hat{g}(X_i) - g(X_i)) - \frac{1}{N_0} \sum_{i=1}^{N_0} (\hat{g}(X_i) - g(X_i)) represents the bias due to regularization in the machine learning method.
  5. Conclusion:

    • The bias term does not converge to zero, leading to an asymptotically biased estimator.
    • This illustrates why naive application of machine learning methods without addressing regularization bias can lead to incorrect estimates of treatment effects.

Steps to Obtain a Debiased Estimator

To obtain a debiased estimator using machine learning methods, we follow a systematic approach that addresses the regularization bias inherent in machine learning models. Here is a detailed process:

1. Setup the Problem

Define the outcome YY, treatment indicator ZZ, and covariates XX. The objective is to estimate the average treatment effect (ATE).

2. Split the Data

Divide the dataset into two parts to avoid overfitting and ensure valid inference:

  • Part 1: Used to estimate the propensity score.
  • Part 2: Used to estimate the regression function.

3. Estimate the Propensity Score

Using Part 1 of the data, estimate the propensity score e(X)=P(Z=1X)e(X) = P(Z = 1 | X). This can be done using a machine learning model such as logistic regression, random forests, or other methods.

4. Estimate the Regression Function

Using Part 2 of the data, estimate the regression function g(X)=E[YX]g(X) = E[Y | X] using a machine learning model such as random forests, gradient boosting machines, or any other suitable method.

5. Calculate Residuals

For the treated and control groups in Part 2 of the data, calculate the residuals:

U^i=Yig^(Xi)\hat{U}_i = Y_i - \hat{g}(X_i)

6. Debiasing Step

Estimate the treatment effect using the residuals and propensity scores. Calculate the debiased estimate by adjusting for the propensity score:

θ^=1Ni=1N(Zi(Yig^(Xi))e^(Xi)(1Zi)(Yig^(Xi))1e^(Xi))+1Ni=1Ng^(Xi)\hat{\theta} = \frac{1}{N} \sum_{i=1}^N \left( \frac{Z_i (Y_i - \hat{g}(X_i))}{\hat{e}(X_i)} - \frac{(1 - Z_i) (Y_i - \hat{g}(X_i))}{1 - \hat{e}(X_i)} \right) + \frac{1}{N} \sum_{i=1}^N \hat{g}(X_i)

7. Variance Estimation

Estimate the variance of the debiased estimator to construct confidence intervals. This step involves calculating the standard error of the debiased estimate.

8. Construct Confidence Intervals

Using the standard error, construct confidence intervals for the treatment effect estimate.