1 Introduction and Background
1.1 What is Causal Inference?
Causal inference is a field of study that focuses on identifying and estimating causal relationships between things. It goes beyond correlation by establishing a cause-and-effect relationship. Causal inference methods often utilise counterfactual reasoning to estimate the causal effect of an exposure or treatment on an outcome. Such counterfactual reasoning is used unbeknownst every day. For example, if someone misses their bus and thinks, “If I had left home five minutes earlier, I wouldn’t have missed it,” they are engaging in counterfactual reasoning. In everyday life, policy-making, medicine, and business, understanding the size and nature of a cause is essential for decision making and avoiding misleading conclusions. In this background chapter I discuss key ideas in causal inference such as the potential outcomes framework, common estimands, and assumptions.
1.2 Layout
Chapter 1 provides a foundational introduction to causal inference, which is essential for understanding the context of this project. This section is designed to provide a concise background for readers who may not be familiar with causal inference, ensuring they have the necessary foundation to follow the rest of the project.
Chapter 2 introduces the central focus of this project: the use of machine learning to estimate propensity scores. The section begins with a traditional introduction to propensity scores, explaining their role in balancing covariates between treatment and control groups to reduce estimator bias. This leads into a discussion on the limitations of conventional propensity score estimation methods, which often rely on logistic regression. These limitations motivate the use of machine learning algorithms for propensity score estimation. The section then provides a theoretical comparison of common machine learning algorithms such as random forests, bootstrap aggregation (bagging), and gradient boosting machines. The goal of this section is to provide readers with an intuitive understanding of how machine learning can enhance propensity score estimation, setting the stage for practical applications later in the project.
Chapter 3 presents a comprehensive tutorial for implementing machine learning techniques in the estimation of propensity scores. This section is highly practical, walking the reader through the software implementations available for estimating and assessing propensity scores. The National Supported Work (NSW) program dataset is used as a running example throughout the tutorial. This dataset is commonly used in causal inference studies due to its simplicity and well-documented treatment effect, making it an ideal example candidate. By the end of this section, readers should be able to apply these methods to their own datasets, replicating the steps and analyses presented.
Chapter 4 provides a detailed replication study of Jena et al. (2012), a paper that examines the impact of fair trade certification on farmers’ livelihood in Ethiopia. This section builds on the original findings of the aforementioned authors by applying machine learning-based propensity score methods. The replication study is designed to demonstrate the practical advantages of using machine learning for propensity score estimation in a real-world setting. Specifically, my replication re-examines the causal effect on per capita income. This comparison highlights how machine learning can potentially lead to more accurate and reliable estimates of treatment effects.
Chapter 5 provides a comprehensive overview of the findings from this project. It synthesizes the key insights gained from the theoretical discussions, the practical tutorial, and the replication study. This section also outlines potential avenues for future research, emphasizing the importance of continued exploration of machine learning methods in causal inference.
Appendix A offers supplementary material that supports the main content of the project. This includes explanations of the datasets used, along with coded examples of loading and manipulating the data in R. Additionally, Appendix B presents custom functions developed for this project, which are used to present results and facilitate analyses throughout the project.
1.3 Potential Outcomes Framework
The Potential Outcomes Framework, also known as the Rubin Causal Model, was introduced by Rubin (1974) and builds upon the work of Splawa-Neyman (1923). The framework dominates how researchers think about causal inference by formalising counterfactual reasoning. Rubin defines a causal effect as a defined comparison between two states of the world. For each individual, there are two potential outcomes: one if they receive the treatment and one if they do not. The causal effect is the difference between these two potential outcomes. Hence we have two potential outcomes, one with the treatment and one without.
The framework is highly flexible and adaptive, extending beyond traditional notions of “treatment” in medical or experimental contexts. It can apply to any kind of intervention, exposure, or condition that could influence an outcome, whether it’s a medical treatment, policy change, environmental exposure, or even abstract events like decisions or natural occurrences. Philosophically, the framework aligns with a view of the world that considers reality through alternative scenarios or what-ifs.
Consider a binary treatment variable, let the treatment for an observation be a random variable, \(T\), with a realisation \(t_i \in \{0,1\}\) under control and treatment. The absence of treatment is refer to as the control state. Let \(Y_i(1)\) and \(Y_i(0)\) be the two potential outcomes for observation \(i\) under treatment and control. Let the individual treatment effect (ITE) be defined as the difference between the two potential outcomes:
\[ \tau_i=Y_i(1) - Y_i(0). \tag{1.1}\]
There is a clear problem that only the outcome under either treatment or control is observable. If our observations are on people, then it is logically impossible for an individual to simultaneously both receive and not receive the treatment. For example, if someone takes medication to relieve a headache and their headache improves, it could never know what would have happened if they did not take the medication. This leads to the commonly discussed fundamental problem of causal inference - it is impossible to observe both potential outcomes. A counterfactual, the counter to the observed outcome, is infeasible and can never be practicably known.
Let the observed outcome for \(i\) be denoted \(y_i(1)\) and \(y_i(0)\) under treatment or control. Many causal inference methods involve finding or estimating a counterfactual to compare outcomes to sove some variation of Equation 1.1. Let an estimated potential outcomes for \(i\) be denoted \(\hat{y}_i(1)\) and \(\hat{y}_i(0)\) under treatment or control.
1.4 Estimands
In causal inference, there are multiple parameters of interest called and estimand. The preferred estimand depends on the motivating example, discipline, or intended interpretation of a result.
The most basic estimand is the average treatment effect or the ATE denoted \(\tau_{ATE}\) which is the average amount of effect on all individuals in the population regardless of whether they receive the treatment or not. This can be written as:
\[ \begin{aligned} \text{ATE}& = E[\tau_{ATE}] \\ &= E[Y(1) - Y(0)] \\ &= E[Y(1)] - E[Y(0)]. \end{aligned} \tag{1.2}\]
Under certain conditions, such as a randomised control trial, Equation 1.2 can be an estimated using the explicit equation:
\[ \widehat{\text{ATE}} = \hat{\tau}_{ATE}= \frac{1}{N_t} \sum_{i=1}^{n} (y_i \mid t_i = 1) - \frac{1}{N_c} \sum_{i=1}^{n} (y_i \mid t_i = 0), \tag{1.3}\]
where \(N_t\) and \(N_c\) are is the number of treated and control observations. Essentially, Equation 1.3 is just a difference in the mean outcome between the two groups.
The second parameter of interest is the average treatment effect on the treated or ATT and is the difference (contrast) between the potential outcomes of those who actually receive the treatment. In other words, considering observations where \(t_i=1\), what is the effect of the treatment? This can be written as:
\[ \begin{aligned} \text{ATT} &= \tau_{ATT}= E[\tau \mid T = 1] \\ & = E[Y(1) - Y(0) \mid T = 1] \\ & = E[[Y(1) \mid T = 1] - E[Y(0) \mid T = 1]. \end{aligned} \tag{1.4}\]
The final parameter is the average treatment effect on the control or ATC which is similar to the ATT but on those who are actually under control. The ATC is the contrast between the two potential outcomes for individuals which are actually in the control. This is also known as the average treatment effect on the untreated or the ATU. It can be written as:
\[ \begin{aligned} \text{ATC} &= \tau_{ATC} = E[\tau\mid T = 0] \\ & = E[Y(1) - Y(0) \mid T = 0] \\ & = E[[Y(1) \mid T = 0] - E[Y(0) \mid T = 0] \end{aligned} \tag{1.5}\]
For the estimated ATT and ATC, no explicit expression exist. Estimation is completed using G-methods to obtain contrasts of potential outcomes (see Naimi, Cole, and Kennedy 2017).
1.5 Assumptions in Causal Inference
Assumptions are made for the potential outcomes framework to be logically coherent and for estimands to be identifiable. Firstly, independence must be assumed, implying the potential outcomes are independent of \(T\). This assumption is also known as unconfoundedness, ignorability, or selection on observables, and means there is no confounding relationship between the treatment and potential outcomes. This matters as confounding variables can create a spurious relationship between the treatment and the outcome, leading to biased estimates of the treatment’s effect. Hence, the treatment assignment should be random, allowing an unbiased estimate. Mathematically independence can be stated as:
\[ Y(1), Y(0) \perp \!\!\! \perp T. \tag{1.6}\]
Independence implies exchangeability meaning the individuals in the treatment and control groups could be swapped and the potential outcomes are still the same. A weaker assumption is conditional independence that states that assignment into treatment is random conditioned on some \(X\):
\[ Y(1), Y(0) \perp \!\!\! \perp T\mid X. \tag{1.7}\]
The assumption requires that covariates must be known and measurable which may not always hold. Independence motivates the use of randomisation in experimental contexts as this should guarantee independence. Chapter 2 discusses conditional independence and uses propensity scores to condition on covariates.
A second assumption is positivity. This means that for each \(i\), the condition probability when \(X=x\) of being in either the treatment or control group is strictly between \(0\) and \(1\). In other words, \(\Pr(T = 1 \mid X = x) > 0\) and \(\Pr(T = 0 \mid X = x) > 0\). This ensures that all observations have at least some chance of receiving either the treatment or control. If not, it is theoretically impossible to obtain both potential outcomes and so the treatment effect cannot be estimated.
Building on positivity, the third assumption is common support. This implies the treatment and control groups overlap in terms of their characteristics. Overlap is crucial because it ensures that for every person in the treatment group, there are similar individuals in the control group—similar in terms of age, gender, income, and other factors. Mathematically, for all of \(i\), if the conditional probability of being treated, \(\Pr(T = 1 \mid X = x)\), is near to \(1\), and \(\Pr(T = 0 \mid X = x)\) is near to \(0\), then there are no compatible cases and there is no common support. Without compatible cases, it is not possible to satisfy exchangability and so treatment effect estimates are likely to be biased.
The fourth assumption is consistency between the potential outcome and observed outcome. For every \(i\), the observed outcome under treatment equals the potential outcome under treatment. Additionally, the observed outcome under control equals the potential outcome under control. Mathematically, \(y_i(1)=Y_i(1)\) and \(y_i(0)=Y_i(0)\) that leads to a switching equation which defines \(y_i\) as a function of the potential outcomes:
\[ y_i = T_i Y_i(1)+(1 - T_{i})Y_i(0). \tag{1.8}\]
Notice the logic of this equation, when \(T=1\) then \(y_i =Y_i(1)\) as the second term becomes zero. Similarly, when \(T=0\) then \(y_i = Y_i(0)\) as the first term becomes zero.
The final key assumption is called the stable unit treatment value assumption or SUTVA. This is a complex way of stating that there is no interference between observations. More specifically, neither potential outcome is affected by the treatment status of any other individual. To borrow terminology from economics, there are no externalities or spillover effects from one observations’ treatment status to another observations’ potential outcomes.
In causal inference, especially when working with observational data, it is critical that these assumptions are considered. If these assumptions do not hold, any model, regardless of the modelling assumptions, will not have a causal interpretation. Unfortunately, there are no tests that can confirm if these causal assumptions hold and thus researchers must understand the context and data generating process in which they operate.
