Causal Inference Basics

Without explicitly stating otherwise, we consider binary treatment.

1. Potential Outcomes

In the Rubin-Neyman potential outcome (PO) framework, the potential outcome $Y(t)$ denotes what subject $i$’s outcome would be if he were to take treatment $t$. For example, if the treatment is dichotomous such as getting a heart surgery, then a subject could either receive the treatment ($t_i = 1$) or does not receive the treatment ($t_i = 1$). $Y(1)$ denotes the outcome with the treatment and $Y(0)$ the outcome without the treatment. We use $Y$ to denote the observed outcome (a.k.a, the factual outcome). Note that there are no counterfactuals or factual until the outcome is observed. Before that, there are only potential outcomes.

The individual treatment effect (ITE) is defined in Eq.(1), which is also known as individual causal effect or unit-level causal effect.

\begin{equation} \tau_i \triangleq Y_i(1) - Y_i(0) \end{equation}

If we use $X$ to denote the covariates and $T$ the random variable that corresponds to the treatment observed, then we usually have the observational data sample $\mathcal{D}= \lbrace{({Y_i, X_i, T_i})}\rbrace_{i=1}^n$ with $({Y_i, X_i, T_i}) \stackrel{i.i.d}{\sim} \mathbb{P}$.

1.1 The Fundamental Problem of Causal Inference

We can only observe one outcome for a given individual, i.e. only $Y_i({1})$ or $Y_i({0})$ can be observed. The unobserved outcome is known as the “counterfactual outcome”. Due to this fundamental problem, the ITE defined in Eq.(1) is simply unobservable. If we take the average over the ITEs for all the individuals, we have average treatment effect (ATE) defined in Eq.(2).

$$ \begin{equation} \tau=\mathbb{E}\lbrack{Y({1})} - Y({0})\rbrack \end{equation} $$

We use $X$ to denote the covariates for an individual. We can then define the conditional average treatment effect (CATE) as Eq.(3).

$$ \begin{equation} \tau({x})=\mathbb{E}\lbrack{Y({1})} - Y({0})\mid{X=x}\rbrack \end{equation} $$

Another interesting estimand is Average Treatment Effect on the Treated (ATT): The ATT, on the other hand, is the average effect of the treatment specifically on those who actually received the treatment. It measures the average difference in outcomes for the treated individuals compared to what would have happened if the same individuals had not received the treatment. Mathematically, it is expressed as

$$ ATT = \mathbb{E}\lbrack{Y({1})} - Y({0})\mid{T=1}\rbrack $$

The ATE, which is the average effect across the entire population, will be the same as the ATT because the treatment effect does not differ between those who were treated and those who were not. Since everyone responds similarly to the treatment, the distinction between the treated and untreated (or the selection into treatment) does not matter in terms of how they respond to the treatment. Thus, the average effect among those who received the treatment (ATT) is representative of the average effect you would expect in the entire population (ATE). If the treatment effect varies based on characteristics such as age, gender, health status, etc., then the ATE and ATT might differ.

Investigators should consider the following central question when conceptualizing the target of inference for a specific study—would it be feasible to treat all eligible patients included in the study with the treatment of interest. If the answer to the central question is no, the treatment would not be given to everyone in the eligible population, and only patients with certain characteristics who actually received the treatment would be ideal candidates for treatment; then the target of inference might be defined as average treatment effect among the treated population (ATT).

1.2 Identification Assumptions

A causal quantity (e.g. $\mathbb{E}\lbrack{Y({t})}\rbrack$) is identifiable if we can compute it from a purely statistical quantity such as $\mathbb{E}\lbrack{Y\vert{t}}\rbrack$.

1.2.1 Exchangeability

The intuition behind exchangeability is that we want to ensure the treatment and control groups are comparable. They are the same in all relevant aspects other than the treatment so that we know any difference in the outcome is attributed to the treatment. This intuition is what underlies the concept of “controlling for” or “adjusting for” variables. Mathematically, exchangeability is expressed as Eq.(4).

\begin{equation} ({Y({1}), Y({0}}))\perp{T} \end{equation}

The exchangeability assumption states that the underlying probability for an outcome when receiving the treatment $T$ is identical among the two groups and the risk is equal to the marginal risk in the whole population. In other words, the control group would show the same risk if they had received the treatment as the treatment group. The counterfactual outcome $Y({t})$ like one’s genetic make-up can be thought of as a fixed characteristic of a person existing before the treatment is randomly assigned. $Y({t})$ encodes what would have been one’s outcome if assigned treatment $t$ and thus does not depend on the treatment you later receive. Moreover, independence between the counterfactual outcome and the observed treatment does not imply independence between the observed outcome and observed treatment.

Exchangeability holds in a randomized experiment but not in an observational dataset. However, if we control for relevant variables by conditioning, maybe the subgroups will be exchangeable. This is known as conditional exchangeability or unconfoundedness as expressed in Eq.(5).

\begin{equation} ({Y({1}), Y({0}}))\perp{T}\mid{X} \end{equation}

Conditional exchangeability is the main assumption necessary for causal inference. We can identify causal effect within the levels of $X$.

\begin{equation} \begin{split} \mathbb{E}\lbrack{Y({1})} - Y({0})\mid{X}\rbrack&=\mathbb{E}\lbrack{Y({1})}\mid{X}\rbrack-\mathbb{E}\lbrack{Y({0})}\mid{X}\rbrack\\ &=\mathbb{E}\lbrack{Y({1})}\mid{T=1,X}\rbrack-\mathbb{E}\lbrack{Y({0})}\mid{T=0,X}\rbrack \\ &=\mathbb{E}\lbrack{Y}\mid{T=1,X}\rbrack-\mathbb{E}\lbrack{Y}\mid{T=0,X}\rbrack \end{split} \end{equation}

1.2.2 Positivity/Overlap

For all values for covariates $x$ present in the population,

\begin{equation}0 <P({T=1\mid{X=x}}) < 1\end{equation}

If we have positivity violation, then in Eq.(6), we would condition on zero-probability event. Intuitively, positivity violation means that for some subgroup of the population, everyone always receive the treatment or always receives the control. Then, it wouldn’t be possible to estimate the causal effect for this subgroup since we only see either treatment or control.

We want the covariate distribution of the treatment group to overlap with the covariate distribution of the control group. This means that for any given set of covariate values, it's possible to find individuals who received the treatment and others who did not, allowing for a meaningful comparison between treated and untreated individuals. Mathematically, we want $P({X\mid{T=1}})$ and $P({X\mid{T=0}})$ (Note these two are conditional distribution instead of a real-valued probability) to have the same support, which is why common support is another alias for positivity.

In practice, assessing the overlap assumption involves examining the distribution of propensity scores or covariates across treatment and control groups. Graphical methods, such as plotting the density or cumulative distribution of propensity scores for each group, can help visualize the extent of overlap. Lack of overlap indicates regions where causal inference may be unreliable. Remember positivity is only required for the variables that are required for exchangeability.

1.2.3 Consistency

If the treatment is $T$, then the observed outcome $Y$ is the potential outcome under treatment $T$. $T=t \implies Y=Y({t})$.

Consistency assumption has two components:

A precise definition of the treatment.
The linkage between counterfactual outcomes to the observed outcomes.

The first component deals with the issue of “multiple versions of the treatment”. Consider the following examples.

If we want to study the heart transplant $T$ on 5-year mortality $Y$. The experiment protocol may want to specify the the details of other procedures such as anesthesia, surgical technique and post-operative care. Without these details, it is possible that each doctor had conducted a different version of “heart transplant” with her preferred surgical technique. If different surgical techniques have different causal effect on mortality, then the causal effect is not well-defined.
In observation studies about interventions that do not correspond well to treatment in the real-world, the problem may be even greater. For example, if the intervention is “exercise” and the outcome is “obesity”, we may want to define the duration, frequency, and type of exercise. Additionally, we also want to specify how the time devoted to exercise would otherwise be spent. If the time goes to playing basketball with the children, then the control group may achieve the same weight loss.
If we want to investigate the effect of “obesity” on death, there are many ways to get to the state of obesity and each of them may have different causal effect on death. For example, the obesity due to genetic deficiency may pose greater risk than lack of exercise.

See the “What if” book Section 3.4 for the discussion on the second component.

1.2.4 No Interference

No interference means that an individual’s outcome is unaffected by anyone else’s treatment. The outcome is only the function of the individual’s own treatment. This assumption could be violated in a social network setting. For example, if the treatment is a feature that enables easy chatting and the outcome is the online time for the chat app, the friends of the individual who receives the treatment could also increase their in-app time.

1.3 Adjustment Formula

Adjustment formula estimates the ATE as follows.

\begin{equation} \begin{split} \mathbb{E}\lbrack{Y({1})}-Y({0})\rbrack &= \\ &= \mathbb{E}\lbrack{Y({1}})\rbrack - \mathbb{E}\lbrack{Y({0}})\rbrack \enspace \small{\text{(linearity of expectation)}}\\ &= \mathbb{E}_\mathbb{X}\lbrace\mathbb{E}\lbrack{Y({1}})\mid{X}\rbrack-\mathbb{E}\lbrack{Y({0}})\mid{X}\rbrack\rbrace \enspace \small{\text{(law of iterated expectation)}} \\ &= \mathbb{E}_\mathbb{X}\lbrace\mathbb{E}\lbrack{Y({1}})\mid{T=1, X}\rbrack-\mathbb{E}\lbrack{Y({0}})\mid{T=0, X}\rbrack\rbrace \enspace \small{\text{(unconfoundedness and positivity)}} \\ &= \mathbb{E}_\mathbb{X}\lbrace\mathbb{E}\lbrack{Y}\mid{T=1, X}\rbrack-\mathbb{E}\lbrack{Y}\mid{T=0, X}\rbrack\rbrace \enspace \small{\text{(consistency)}} \\ \end{split} \end{equation}

Usually, the conditional expectations are replaced by an ML model.

2. Randomized Experiment

We consider two designs.

Marginally randomized experiments. A single unconditional (marginal) randomization probability that is common to all individuals. For example, we flip a coin to decide the treatment assignment for each individual in the population.
Conditionally randomized experiment. We use different randomization probabilities for different levels of discrete variable $L$. It can be considered as the combination of two marginally randomized experiment if $L$ is dichotomous. Conditionally randomized experiment guarantees conditional exchangeability. In this design, we say there is effect modification by $L$ or that treatment effect heterogeneity exists across levels of $L$.

Marginally randomized experiments produce covariate balance, which is the covariate distribution is the same across treatment groups. Covariate balance implies that $X$ and $T$ are independent. Further more, under covariate balance, association is causation, $\text{Pr}({y\mid{\text{do}(t)}})=\text{Pr}({y\mid{t}})$. This is also some of the deep learning based causal inference models attempt to achieve covariate balance in the transformed space using representation learning and then estimate the causal effect.

There are two ways for estimating ATE for conditionally randomized experiment design, namely standardization and inverse probability weighting (IPW). Standardization is the name in epidemiology and it is also known as S-learner.

2.1 Standardization

Under conditional exchangeability, positivity, and consistency, the standardized mean for treatment level $T=t$ as

\begin{equation} \begin{split} \mathbb{E}\lbrack{Y({t})}\rbrack \\ &= \sum_{l}\mathbb{E}\lbrack{Y({t})\mid{L=l}}\rbrack \text{Pr}\lbrack{L=l}\rbrack \\ &= \sum_{l}\mathbb{E}\lbrack{Y({t})\mid{T=t, L=l}}\rbrack \text{Pr}\lbrack{L=l}\rbrack \\ &= \sum_{l}\mathbb{E}\lbrack{Y\mid{T=t, L=l}}\rbrack \text{Pr}\lbrack{L=l}\rbrack \\ \end{split} \end{equation}

2.2 Inverse Probability Weighting (IP)

An individual’s IP weight depends on the values of $T$ and $L$. A treated individual receives weight ${1}/{\text{Pr}\lbrack{T=1\mid{L=l}}\rbrack}$ while the untreated receives ${1}/{\text{Pr}\lbrack{T=0\mid{L=l}}\rbrack}$. We can simplify the notation by using the conditional probability density function of $T$ given $L$, $f({T\mid{L}})$. The IP weight is $W^T=1 / f({T\mid{L}})$. These weights create a pseudo-population that is twice as large as the original. In this pseudo-population, $L$ is independent of $T$.

2.3 Relationship between IPW and Standardization

IPW uses the conditional probability of treatment $T$ given covariate $L$ while standardization uses the marginal probability of $L$ and the conditional probability of the outcome given treatment and $L$. They are actually equivalent. See Technical Point 2.3 of the What-if book.

2.4 An Example - Simpson’s Paradox

Consider the following experiment data for treating kidney stones.

Variable	Treatment A	Treatment B	Total
Stone size = Small	81 / 87 = 0.931	234 / 270 = 0.867	315 / 357
Stone size = Large	192 / 263 = 0.730	55 / 80 = 0.688	247 / 343
Total	273 / 350 = 0.78	289 / 350 = 0.826	562 / 700

The overall success rate for treatment A is 0.78 vs the success rate for B is 0.826, which suggests that treatment B is more effective for kidney stone. However, within the two stone sizes, A is superior to B. This contradiction at sub-group level vs the population level is known as Simpson’s paradox. Stone size is the confounding variable. We can use IPW and standardization to get the true counterfactual risks.

Standardization.
- $\text{Pr}({Y^{A}=1})=0.931\times357/700+0.730\times343/700=0.833$
- $\text{Pr}({Y^{B}=1})=0.867\times357/700+0.688\times343/700=0.779$
IPW
- We use the conditional probability of receiving treatment given the stone size to estimate the propensity score.
- $\text{Pr}({Y^{A}=1})=\cfrac{81\times\cfrac{357}{87}+192\times\cfrac{343}{263}}{357+343}=0.833$
- $\text{Pr}({Y^{B}=1})=\cfrac{234\times\cfrac{357}{270}+55\times\cfrac{343}{80}}{357+343}=0.779$

We can see the results are identical and Treatment A is actually better than Treatment B. This is because treatment B gets assigned more “easy cases” whose stone sizes are small. We can see that the IPW essentially expands the population for two stone sizes to equal size. Specifically, for the small stone, the pseudo population has 357 people for A and 357 for B. On the other hand, the large stone group has 343 for both treatments. In this way, stone size is no longer a confounding factor for treatment effectiveness.

3. Causal Graph

3.1 Backdoor Adjustment

The most common causal graph is the following one.

https://lucid.app/lucidchart/e46f8b0b-a02e-48fd-8ac4-913d3292d8f4/edit?viewport_loc=-102%2C-20%2C2541%2C1272%2C0_0&invitationId=inv_df515617-52a8-40a4-bd7c-4b3463bbd1b3

In this causal graph, association flows along the non-directional path T→X→Y, called the backdoor path, and the directional paths X→Y and T→Y are causal. If we use the backdoor adjustment by conditioning on X, then we would identify the causal effect of T.

The backdoor criterion states that a set of variables $W$ satisfies the backdoor criteria relative to $T$ and $Y$ if

$W$ blocks all backdoor paths from $T$ to $Y$.
$W$ does not contain any descendants of $T$

Intuitively, the criterion stem from examining the causal relationship in the following basic building blocks, namely the chain, fork, and collider.

https://lucid.app/lucidchart/e46f8b0b-a02e-48fd-8ac4-913d3292d8f4/edit?viewport_loc=-102%2C-20%2C2541%2C1272%2C0_0&invitationId=inv_df515617-52a8-40a4-bd7c-4b3463bbd1b3

Based on the Bayesian network factorization, we can easily prove that

For chain and fork, if we condition on $X_2$, $X_1$ and $X_3$ is conditionally independent.
For collider, if we don’t condition on $X_2$, $X_1$ and $X_3$ is conditionally independent. Conditioning on a collider may introduce spurious positive or negative association that does not exist for $X_1$ and $X_3$ .
- Brady Neal’s book discusses the seemingly plausible observation that the “Good-looking men are jerks”: most of the nice men one meets are not very good-looking while most of the good-looking men are jerks. It seems that kindness and looks are negatively associated. There is actually a third important variable: availability. The previous observation is actually conditioned on a collider, namely availability. The looks and kindness are NOT associated in general population. But when we condition on their shared child $X_2$ (availability = Yes) here, they become associated. The association now flows along the path $X_3→X_2←X_1$ despite the fact that it does not when we don’t condition on $X_2$.
- Imagine a study aiming to explore the relationship between two variables: the amount of time spent on social media and personal happiness levels. Conventional wisdom and some research suggest these variables might be independent or even negatively correlated in the general population, with excessive social media use potentially linked to lower happiness levels. However, let's introduce a third variable and see how it affects the observed relationship. Public visibility of social media activity, which can be influenced by both the amount of time spent on social media and personal happiness. For example, individuals who spend a lot of time on social media might post more often, and those who are happier might share more positive content, making their activity more visible and engaging. Suppose we conduct a study focusing only on individuals with high public visibility of their social media activity. This selection criteria (the collider) can introduce a spurious positive association between social media usage and happiness. Why? Because the subgroup of users with high visibility likely includes people who are either spending a lot of time on social media (regardless of their happiness level) or are particularly happy (thus sharing more positive content), or both. By conditioning on the collider (public visibility of social media activity), the study might erroneously conclude that higher social media usage is associated with greater personal happiness among the subgroup, even if these variables are independent or negatively correlated in the general population.

3.2 Front-Door Adjustment (TODO)

4. Estimation

We have the observational data sample $\mathcal{D}= \lbrace{({Y_i, X_i, T_i})}\rbrace_{i=1}^n$ with $({Y_i, X_i, T_i}) \stackrel{i.i.d}{\sim} \mathbb{P}$. Under the identification assumptions, namely unconfoundedness, positivity, and consistency, we define the propensity score as the following equation.

\begin{equation} \pi(x)=P(T=1\mid{X=x}) \end{equation}

We also define the conditional outcome as the following equation.

$$\mu_t(x)=\mathbb{E}_{\mathbb{P}}\lbrack{Y\mid{T=t, X=x}}\rbrack$$

We assume no parametric form for these parameters and therefore this is a non-parametric estimation problem. The various estimation methods are mainly characterized by two aspects:

How the nuisance parameters $\eta=(\mu_0(x), \mu_1(x), \pi(x))$ are estimated: the parametric form, the model used.
How to combine/use these estimates.

4.1 Plug-in Estimators

The quantities of interests are ATE and CATE and ATT. CATE is also known as individualized average treatment effects (IATEs). Recall we define ATE as follows.

\begin{equation} \mathbb{E}\lbrack{Y({1})}-Y({0})\rbrack = \mathbb{E}_\mathbb{X}\lbrace\mathbb{E}\lbrack{Y}\mid{T=1, X}\rbrack-\mathbb{E}\lbrack{Y}\mid{T=0, X}\rbrack\rbrace \end{equation}

We can fit a statistical or machine learning model to estimate the conditional expectation $\mathbb{E}\lbrack{Y}\mid{T, X}\rbrack\rbrace$ and then approximate the outer expectation $\mathbb{E}_\mathbb{X}$ with its emprical mean over $n$ data points. If we use $\mu({1, x})$ and

$\mu({0, x})$ to denote the two conditional expectations, we can estimate ATE as

\begin{equation} \hat{\tau}=\frac{1}{n}\sum_{\substack{i}}\lbrack{\hat{\mu}({1, x_i})-\hat{\mu}({0, x_i})}\rbrack \enspace \small{\text{(ATE)}} \end{equation}

To estimate CATE, we can select those observations that have $x_i=x$.

\begin{equation} \hat{\tau}({x})=\frac{1}{n_x}\sum_{\substack{i:x_i=x}}\lbrack{\hat{\mu}({1, x_i})-\hat{\mu}({0, x_i})}\rbrack \enspace \small{\text{(CATE)}} \end{equation}

The estimator in Eq.(13) for CATE is likely prone to positivity violation since positivity needs to be satisfied for all levels of $X$. For the estimators in Eq.(12) and (13), if we use the treatment as a covariate in a single model for the two outcome, then this estimator is called S-estimator since we only fit one model for $\mu$. Other names are conditional outcome model, standardization, and parametric G-formula. S-learner could bias towards zero since $X$ is usually high-dimensional (See Metalearners for estimating heterogeneous treatment effects using machine learning). T-learner fits two separate models for values of treatment to ensure that $T$ cannot be ignored.

Another estimator called “X-learner” is different from both S-learner and T-learner since it has three steps.

Fit the outcome models as the T-learner.
Fit two estimators for IATEs.
Combine the two estimators as a weighted average. Propensity score is suggested to work well.

The paper Nonparametric Estimation of Heterogeneous Treatment Effects: From Theory to Learning Algorithms) suggests that we can directly regress the following pseudo-outcome directly on covariates. The model can be used to directly estimate CATE.

\tilde{Y} = W(Y-\hat{\mu}({0, X})+(1-W)(\hat{\mu}({1, X})-Y)

4.2 Representation Learning Methods

A variety of deep learning based methods use neural networks to transform the covariates to a representation space where then the nuisance parameters are then modeled. This representation is usually learned by posing the desirable properties for causal inferencing as the constraints on the form of NN objective functions. For example, covariate balance is a desirable property that would give us conditional exchangeability. We therefore can learn a common representation for both outcome or even the treatment assignment. This representation is then shared and would have covariate balance.

Example NNs (TODO)

4.3 Tree-based Methods

4.4 Methods for Multi-valued and Continuous Treatment

4.4.1 Generalized Propensity Score

A common method for dealing with multi-valued and continuous treatment is the generalized propensity score. It is developed in The propensity score with continuous treatments.

Let $r(t, x)$ be the conditional density of the treatment given the covariates

$$ r(t, x)=f_{T|X}(t\mid{x}) $$

Then the generalized propensity score is $R=r(T, X)$

We can follow the procedure below to use the GPS.

In the first stage, we use a normal distribution for the treatment given the covariates. We may consider more general models such as mixtures of normals, or heteroskedastic normal distributions with the variance being a parametric function of the covariates. We estimate the parameter by maximum likelihood.

$$ T_i | X_i \sim \mathcal{N}(\beta_0 + \beta' X_i, \sigma^2) $$

$$ \hat{R}_i = \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left( -\frac{1}{2\sigma^2} (T_i - \hat{\beta}_0 - \hat{\beta}' X_i)^2 \right). $$

In the second stage, we model the conditional expectation of $Y_i$ given $T_i$ and $R_i$ as a flexible function of its two arguments. For example, we can use the following quadratic equation.

$$ E[Y_i | T_i, R_i] = \alpha_0 + \alpha_1 T_i + \alpha_2 T_i^2 + \alpha_3 R_i + \alpha_4 R_i^2 + \alpha_5 T_i R_i. $$

Given the estimated parameter in the second stage, we estimate the average potential outcome at treatment level $t$ as

$$ \hat{E}[Y(t)] = \frac{1}{N} \sum_{i=1}^{N} \left( \hat{\alpha}_0 + \hat{\alpha}_1 t + \hat{\alpha}_2 t^2 + \hat{\alpha}_3 \hat{r}(t, X_i) + \hat{\alpha}_4 \hat{r}(t, X_i)^2 + \hat{\alpha}_5 t \hat{r}(t, X_i) \right). $$

Another way of using the GPS is given in What-If book and the method is very similar to applying the regular propensity score to create a pseudo-population.

First step is estimate the stabilized weights as $SW^T = f(T)/f(T\mid{X})$. We could make parametric assumptions on both densities such as Gaussian distribution.
Compute the weights using the estimated densities then fit a weighted regression model for the outcome with covariates.