library('causaldata')
data("nsw_mixtape", package = "causaldata")
nsw_data <- as.data.frame(nsw_mixtape)
nsw_data$data_id <- seq(1,length(nsw_data$data_id))
nsw_data$degree <- abs(nsw_data$nodegree-1)
nsw_data$nodegree <- NULL Appendix A — Datasets
A.1 National Supported Work Data
The National Supported Work (NSW) Demonstration Job Training Program dataset originates from a large-scale social experiment conducted in the 1970s in the United States aimed at evaluating the impact of job training on employment and earnings among disadvantaged groups, including ex-addicts, ex-offenders, youth dropouts, and long-term unemployed women. The data contains a wide range of covariates including as age, education, pre-treatment earnings, marital status, and race.
The study is a randomized controlled trial (RCT) design which is rare for jobs and employment data. Participants were randomly assigned to either a treatment group, which received job training, or a control group, which did not. “Job training” may have included but is not limited to temporary work programmes, highly supervised work, and peer support programs. This randomisation is notable as it as it simplifies the calculation of a treatment effect.
Initially LaLonde (1986) used the NSW dataset to compare experimental and non-experimental estimators of the treatment effect. His findings highlighted significant discrepancies between the two, underscoring the importance of randomization in estimating causal effects. This study has been widely cited and forms the basis for many discussions on the validity of non-experimental methods. Following this, Dehejia and Wahba (1999), revisited LaLonde’s analysis and compared many different contemporary methods with varying results.
For these reason it is commonly used in the literature as a toy dataset. It serves as a practical example for students learning about causal inference, allowing them to understand and apply different econometric methods.
A.2 Coffee Data from Jena et al. (2012)
The data used in the study by Jena et al. (2012) focuses on small coffee farmers in Ethiopia. It includes a comprehensive survey of coffee-producing households, capturing various socioeconomic and agricultural variables. Key data points include household income, coffee production levels, prices received for coffee (both certified and non-certified), costs associated with certification, and access to markets. Additionally, the dataset encompasses demographic information such as household size, education levels, and access to resources like credit and extension services. This rich dataset allows for a detailed analysis of the impact of coffee certification on the livelihoods of these farmers, providing insights into both the benefits and challenges associated with certification programs.
The data is best accessed from Lampach and Morawetz (2016) where the data is available in the supplimentary information: https://www.tandfonline.com/doi/full/10.1080/00036846.2016.1153795. This data is also included on this project’s GitHub under datasets/.
library(haven)
coffee_data <- read_dta("datasets/Jena_etAl_LampachMorawetz.dta")
coffee_data <- zap_formats(coffee_data)
coffee_data <- coffee_data[-c(56,84,156 ),]