Chapter 6 Interrupted time series analysis (ITSA) with R: A short tutorial
6.1 Introduction
Previously, I wrote a tutorial on how to perform an interrupted time series analysis (ITSA) in Stata
, which is located on my RPubs site (link). This had me thinking of how to perform an ITSA using R
. So, I decided to create a short tutorial on how to perform an ITSA on R
using the Agency for Healthcare Research and Quality (AHRQ) Medical Expenditure Panel Survey (MEPS) dataset. I tried several ways to do this, and I think I’ve found a nice approach that leverages survey weights from the MEPS database. Although I’m writing this ITSA tutorial with this approach, I plan on researching other methods to perform an ITSA in R
so stay tuned.
This short tutorial will summarize how to perform an ITSA using a linear regression model with survey weights applied from the MEPS database. (Note: I previously wrote a tutorial on how to apply weights from MEPS data (link).
6.2 Motivating example - Total Healthcare Costs from 2016 to 2021 by sex
We’ll use the MEPS database to compare the trends in total healthcare costs between sex. We’ll assume that a policy or intervention was implemented in 2019, which may have changed the healthcare costs levels and trends between males and females.
I pooled data from 2016 to 2021 and adjusted the weights. I won’t go into details on how to create this dataset for this tutorial since I provided a detailed summary in a previous tutorial. For more details on how I created this dataset, please refer to my previous tutorial (link).
# To install MEPS package in R, you need to do a couple of things.
### Step 1: Install the "devtools" package.
#install.packages("devtools")
### Step 2: Install the "MEPS" package from the AHRQ MEPS GitHub site.
#devtools::install_github("e-mitchell/meps_r_pkg/MEPS")
### step 3: Load the MEPS package
library("MEPS") ## You need to load the library every time you restart R
### Step 4: Load the other libraries
library("survey")
library("foreign")
library("dplyr")
library("ggplot2")
library("questionr") # remotes::install_github("juba/questionr")
library("lspline") # devtools::install_github("mbojan/lspline", build_vignettes=TRUE)
library("ggeffects") # remotes::install_github("strengejacke/ggeffects")
library("margins")
# There are two ways to load data from AHRQ MEPS website:
#### Method 1: Load data from AHRQ MEPS website
= read_MEPS(file = "h233")
hc2021 = read_MEPS(file = "h224")
hc2020 = read_MEPS(file = "h216")
hc2019 = read_MEPS(file = "h209")
hc2018 = read_MEPS(file = "h201")
hc2017 = read_MEPS(file = "h192")
hc2016
#### Method 2: Load data from AHRQ MEPS website
= read_MEPS(year = 2021, type = "FYC")
hc2021 = read_MEPS(year = 2020, type = "FYC")
hc2020 = read_MEPS(year = 2019, type = "FYC")
hc2019 = read_MEPS(year = 2018, type = "FYC")
hc2018 = read_MEPS(year = 2017, type = "FYC")
hc2017 = read_MEPS(year = 2016, type = "FYC")
hc2016
## Change column names to lowercase
names(hc2021) <- tolower(names(hc2021))
names(hc2020) <- tolower(names(hc2020))
names(hc2019) <- tolower(names(hc2019))
names(hc2018) <- tolower(names(hc2018))
names(hc2017) <- tolower(names(hc2017))
names(hc2016) <- tolower(names(hc2016))
# We need the linkage file with the appropriate stratum of the primary sampling unit (STRA9621) and primary sampling unit (PSU9621). (Note: Each year, the linkage file sampling unit name changes)
= read_MEPS(type = "Pooled linkage")
linkage names(linkage) <- tolower(names(linkage)) # change variable name to lower case
# Select specific variables
### 2021
= hc2021 %>%
hc2021p rename(
perwt = perwt21f,
totexp = totexp21,
ertexp = ertexp21) %>%
select(
dupersid,
panel,
varstr,
varpsu,
perwt,
sex,
totexp,
ertexp)$year <- 2021
hc2021p
### 2020
= hc2020 %>%
hc2020p rename(
perwt = perwt20f,
totexp = totexp20,
ertexp = ertexp20) %>%
select(
dupersid,
panel,
varstr,
varpsu,
perwt,
sex,
totexp,
ertexp)$year <- 2020
hc2020p
### 2019
= hc2019 %>%
hc2019p rename(
perwt = perwt19f,
totexp = totexp19,
ertexp = ertexp19) %>%
select(
dupersid,
panel,
varstr,
varpsu,
perwt,
sex,
totexp,
ertexp)$year <- 2019
hc2019p
### 2018
= hc2018 %>%
hc2018p rename(
perwt = perwt18f,
totexp = totexp18,
ertexp = ertexp18) %>%
select(
dupersid,
panel,
varstr,
varpsu,
perwt,
sex,
totexp,
ertexp)$year <- 2018
hc2018p
### 2017
= hc2017 %>%
hc2017p rename(
perwt = perwt17f,
totexp = totexp17,
ertexp = ertexp17) %>%
select(
dupersid,
panel,
varstr,
varpsu,
perwt,
sex,
totexp,
ertexp)$year <- 2017
hc2017p
### 2016
= hc2016 %>%
hc2016p rename(
perwt = perwt16f,
totexp = totexp16,
ertexp = ertexp16) %>%
select(
dupersid,
panel,
varstr,
varpsu,
perwt,
sex,
totexp,
ertexp)$year <- 2016
hc2016p
# Merge data and adjust the person weight by 6 years
= bind_rows(hc2021p,
pool_data
hc2020p,
hc2019p,
hc2018p,
hc2017p,%>%
hc2016p) mutate(poolwt = perwt / 6)
# Add a new variable for the pre-post index year at 2019
$period[pool_data$year >= 2016 & pool_data$year < 2019] = 0
pool_data$period[pool_data$year >= 2019 & pool_data$year < 2022] = 1
pool_datahead(pool_data)
## # A tibble: 6 × 11
## dupersid panel varstr varpsu perwt sex totexp ertexp
## <chr> <dbl+lbl> <dbl> <dbl> <dbl> <dbl+lbl> <dbl> <dbl>
## 1 2320005101 23 [23 PANEL 23] 2079 1 6785. 2 [2 FEMALE] 4908 0
## 2 2320005102 23 [23 PANEL 23] 2079 1 6177. 1 [1 MALE] 21257 0
## 3 2320006101 23 [23 PANEL 23] 2028 1 1599. 2 [2 FEMALE] 827 0
## 4 2320006102 23 [23 PANEL 23] 2028 1 1649. 1 [1 MALE] 0 0
## 5 2320006103 23 [23 PANEL 23] 2028 1 2892. 1 [1 MALE] 0 0
## 6 2320012102 23 [23 PANEL 23] 2069 2 1273. 2 [2 FEMALE] 9813 0
## year poolwt period
## <dbl> <dbl> <dbl>
## 1 2021 1131. 1
## 2 2021 1029. 1
## 3 2021 267. 1
## 4 2021 275. 1
## 5 2021 482. 1
## 6 2021 212. 1
# Reduce the linkage file to only include dupersid, panel, stra9621, psu9621
= linkage %>%
linkage_file select(dupersid, panel, stra9621, psu9621)
# Merge link file with main data
= left_join(pool_data,
pool_data_linked
linkage_file, by = c("dupersid", "panel"))
6.3 Interrupted Time Series Analysis Design
In this ITSA motivating example, we are going to compare the trends in healthcare total costs between males and females before and after the hypothetical policy intervention, which was implemented in 2019. The ITSA design is illustrated in the figure below.
For this example, the ITSA model is denoted by the following:
\(Y_i = \beta_0 + \beta_1(T_i) + \beta_2(X_i) + \beta_3(T_i * X_i) + \beta_4(Sex_i) + \beta_5(Sex_i * T_i) + \beta_6(Sex_i * X_i) + \beta_7(Sex_i * T_i * X_i) + \epsilon_i\),
where \(Sex_i\) is a new variable that denotes the groups (e.g., sexes), \(\beta_4\) denotes the difference in the outcomes between the groups (e.g., sexes) at beginning of the study (\(T = 0\)), \(\beta_5\) denotes the difference in the slopes between the groups before the intervention, \(\beta_6\) denotes the difference in the level changes between the groups immediately after implementation of the intervention \(X\), and \(\beta_7\) denotes the difference in the slopes after and before the intervention between the groups (e.g., difference-in-differences).
6.4 Set up the survey options
In order to estimate the appropriate costs and standard errors, we will need to set up the survey options and survey design using the weights from MEPS. We will use the svydesign
function to load the primary sampling unit, individual weights, and strata, which we will then assign to an object called survey_design
. Please see the code below:
# Set the survey options
options(survey.lonely.psu = "adjust")
options(survey.adjust.domain.lonely = TRUE)
# Define the survey design
= svydesign(
survey_design id = ~psu9621,
strata = ~stra9621,
weights = ~poolwt,
data = pool_data_linked,
nest = TRUE)
6.5 Descriptive analysis
Once that’s done, we can do some descriptive analysis.
Using the svytable
function, we can estimate the weighted-sample of males and females in the United States (US). In our MEPS dataset, there are 160,228,773 males and 166,678,038 females, which is representative of the non-institutionalized civilian US population.
svytable(~sex, design = survey_design)
## sex
## 1 2
## 160228772.9 166678038.1
Moreover, we can look at the survey-weighted total healthcare costs between males and females. For instance, in 2016, the average annual total healthcare costs for males and females were $4351 and $5633 per person, respectively.
<- svyby(~totexp, ~year+sex, survey_design, svymean)
df1 df1
## year sex totexp se
## 2016.1 2016 1 4351.797969 143.3686740
## 2017.1 2017 1 4792.911089 153.9208333
## 2018.1 2018 1 5488.304037 174.0676798
## 2019.1 2019 1 5747.014931 158.7998588
## 2020.1 2020 1 5861.277852 244.1047619
## 2021.1 2021 1 6250.655387 282.9494481
## 2016.2 2016 2 5632.649717 153.3156133
## 2017.2 2017 2 5797.182100 162.3600394
## 2018.2 2018 2 6614.338104 173.1029815
## 2019.2 2019 2 6736.551546 158.9384299
## 2020.2 2020 2 6654.997756 204.6000543
## 2021.2 2021 2 7597.159563 234.2313172
We can also look at the survey-weighted average cost before and after the intervention (before and after 2019). For instance, before the intervention the survey-weighted average total healthcare cost for males and females were $4880 and $6016, respectively. After the intervention, the survey-weighted average total healthcare costs for males and females were $5954 and $6997, respectively.
<- svyby(~totexp, ~sex+period, survey_design, svymean)
df2 df2
## sex period totexp se
## 1.0 1 0 4879.518399 103.6333733
## 2.0 2 0 6016.325527 109.4863008
## 1.1 1 1 5954.493236 166.0967456
## 2.1 2 1 6997.285661 138.6561924
We can visualize the trend of the survey-weighted total healthcare costs for males and females from 2016 to 2021 using the ggsurvey
function. From the plot, we can see that the female group had a higher total healthcare cost trend compared to males.
# Plot total expenditures over time (Load the "questionr" package to use "ggsurvey")
ggsurvey(survey_design) +
aes(x = year, y = totexp, group = factor(sex), color = factor(sex)) +
stat_summary(fun.y = "mean", geom = "line", size = 2)
6.6 Linear regression model
We can now construct our linear regression model using the ITSA framework. Because we have survey-weights that need to be applied, we will use the svyglm
function. We will have several interaction terms according to the ITSA formula presented above. The triple interaction term (factor(period):factor(sex):year
) denotes the difference-in-differences estimation. In other words, it is the difference in total healthcare costs between males and females before and after the intervention was implemented. Think of it as subtracting the difference before and after for males and the difference before and after for females. Thus, we are getting two types of differences or the difference-in-differences.
sex
denotes our grouping variableperiod
denotes the binary variable that denotes when the intervention was implemented (0 = before
,1 = after
)year
denotes the time variable in the model
# ITSA: Triple interaction approach
<- svyglm(totexp ~ factor(sex) + factor(period) + year + factor(period):factor(sex) + factor(sex):year + factor(period):factor(sex):year, design = survey_design) ## ITSA linear model
itsa1 <- confint(itsa1) ## generate the 95% CI
itsa1.confint
# "cbind" the coefficient output with 95% CI output
round(
cbind(
summary(itsa1, df.resid = degf(itsa1$survey.design))$coef,
confint(itsa1, df = degf(itsa1$survey.design))
2) ),
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1141706.66 227313.30 -5.02 0.00
## factor(sex)2 156608.67 280182.38 0.56 0.58
## factor(period)1 638235.37 399815.90 1.60 0.11
## year 568.46 112.71 5.04 0.00
## factor(sex)2:factor(period)1 -516623.61 486326.52 -1.06 0.29
## factor(sex)2:year -77.08 138.92 -0.55 0.58
## factor(sex)1:factor(period)1:year -316.27 198.07 -1.60 0.11
## factor(sex)2:factor(period)1:year -60.45 181.29 -0.33 0.74
## 2.5 % 97.5 %
## (Intercept) -1588358.77 -695054.56
## factor(sex)2 -393926.88 707144.22
## factor(period)1 -147370.28 1423841.03
## year 347.00 789.92
## factor(sex)2:factor(period)1 -1472215.58 438968.35
## factor(sex)2:year -350.04 195.88
## factor(sex)1:factor(period)1:year -705.47 72.93
## factor(sex)2:factor(period)1:year -416.68 295.78
6.7 Plot
We can plot our ITSA along with the counterfactuals. We create two new objects itsa1
and itsa2
, which contain the predicted values from our model. The first object itsa1
contains the predictive value with all of our terms including the triple interaction. The second object itsa2
contains only the counterfactual terms.
#############
# Plot
#############
### Plot: Part 1 - Generate the predictions
<- ggpredict(itsa1, terms = c("year", "sex", "period"), ci.level = 0.95) # Full model
plot.itsa1 <- ggpredict(itsa1, terms = c("year", "sex"), ci.level = 0.95) # Counterfactual
plot.itsa2
### Plot: Part 2 - Plot the fitted values
# Start with the scatter (average values at each year by group)
<- c("darkgreen", "navyblue") # Create color vector
colors
plot(df1$year, df1$totexp, pch = 19, cex = 1.5, lwd = 1,
col = colors[factor(df1$sex)],
xlab = "Time (year)",
ylab = "Average total healthcare costs ($)",
ylim = c(4000, 8000))
legend("topleft", title = "Groups",
legend = c("Male", "Female"),
fill = c("darkgreen", "navyblue"))
# Add the lines for the per-post index periods
with(subset(plot.itsa1, x <= 2019 & group == 1 & facet == 0), lines(x, predicted, col = "darkgreen", lwd = 2))
with(subset(plot.itsa1, x <= 2019 & group == 2 & facet == 0), lines(x, predicted, col = "navyblue", lwd = 2))
with(subset(plot.itsa1, x >= 2019 & group == 1 & facet == 1), lines(x, predicted, col = "darkgreen", lwd = 2))
with(subset(plot.itsa1, x >= 2019 & group == 2 & facet == 1), lines(x, predicted, col = "navyblue", lwd = 2))
# Line for the index period
abline(v = 2019, col = "red", lty = 2, lwd = 2)
# Counterfactual lines
with(subset(plot.itsa2, x >= 2019 & group == 1), lines(x, predicted, col = "darkgreen", lty = 2, lwd = 2))
with(subset(plot.itsa2, x >= 2019 & group == 2), lines(x, predicted, col = "navyblue", lty = 2, lwd = 2))
6.8 Parallel trends assumption
The most critical part of the ITSA results are the parallel trends assumption and the difference-in-differences estimator.
Let’s start with the parallel trends assumption. We need to test to see if the trends between males and females are similar before the implementation of the intervention, which we can find in our model output. Based on our findings (see factor(sex)2:year
coefficient), the slopes between the two groups (males and females) before the intervention were not statistically different (average difference = -$77 (95% CI: -$350, $196). This means that the slopes are not significantly different between males and females. See ITSA Figure below with annotations.
Once we are confident that the trends prior to the implementation of the intervention are not different, any changes we see after the intervention can be attributed to the intervention. This is an important condition for the difference-in-differences design.
6.9 Additional estimations
We can use the margins
command to estimate the partial derivatives of the total healthcare costs along different parts of the timeline.
For instance, the average difference in total healthcare costs for males and females before and after the intervention were $1020 and $777, respectively. I’ve summarized these results into a table below.
### Average difference total costs in the pre- and post-intervention periods.
<- margins(itsa1, type = "response", design = survey_design, at = (list(period = 0:1)), variables = "sex")
margins1 summary(margins1)
## factor period AME SE z p lower upper
## sex2 0.0000 1020.3870 251.0666 4.0642 0.0000 539.0557 1523.2188
## sex2 1.0000 777.3969 260.7194 2.9817 0.0029 241.4676 1263.4687
We could also estimate the slopes or the average annual change in the total healthcare costs for males and females after the intervention using the margins
commands. For instance, average annual change in total healthcare costs for males and females are $252 and $431, respectively.
### Slopes after the intervention
<- margins(itsa1, type = "response", design = survey_design, at = (list(period = 1, sex = (1:2))), variables = "year")
margins3 summary(margins3)
## factor period sex AME SE z p lower upper
## year 1.0000 1.0000 252.1903 151.9225 1.6600 0.0969 -45.5724 549.9530
## year 1.0000 2.0000 430.9319 134.2015 3.2111 0.0013 167.9019 693.9620
6.10 Issues with the current model
However, it is not possible to compare the slopes using the R
version of the margins
package. I have tried to search for a way to do this, but the margins
package for R
currently can’t perform this comparison. You can follow the discussion on the margins
GitHub page (link), which discusses some potential solution. However, I prefer to use Stata
for these comparisons. Fortunately, there is a way to make some important conclusions using a different method–the linear spline model.
6.11 Workaround using a linear spline model
We could get the difference in the level change immediately after implementation of the intervention at year == 2019
using the lspline
command. We can also get the difference-in-differences estimate with this approach.
# ITSA: Linear splines approach
### Create a knot at 2019
<- svyglm(totexp ~ factor(sex) + factor(period) + lspline(year, c(2019), marginal = TRUE) + factor(sex):lspline(year, c(2019), marginal = TRUE) + factor(sex):factor(period), design = survey_design)
spline1 <- confint(spline1)
spline1.confint
# Output with 95% CI
round(
cbind(
summary(spline1, df.resid = degf(spline1$survey.design))$coef,
confint(spline1, df = degf(spline1$survey.design))
2) ),
## Estimate Std. Error
## (Intercept) -1141706.66 227313.30
## factor(sex)2 156608.67 280182.38
## factor(period)1 -313.65 285.65
## lspline(year, c(2019), marginal = TRUE)1 568.46 112.71
## lspline(year, c(2019), marginal = TRUE)2 -316.27 198.07
## factor(sex)2:lspline(year, c(2019), marginal = TRUE)1 -77.08 138.92
## factor(sex)2:lspline(year, c(2019), marginal = TRUE)2 255.82 240.93
## factor(sex)2:factor(period)1 -118.53 370.05
## t value Pr(>|t|)
## (Intercept) -5.02 0.00
## factor(sex)2 0.56 0.58
## factor(period)1 -1.10 0.27
## lspline(year, c(2019), marginal = TRUE)1 5.04 0.00
## lspline(year, c(2019), marginal = TRUE)2 -1.60 0.11
## factor(sex)2:lspline(year, c(2019), marginal = TRUE)1 -0.55 0.58
## factor(sex)2:lspline(year, c(2019), marginal = TRUE)2 1.06 0.29
## factor(sex)2:factor(period)1 -0.32 0.75
## 2.5 % 97.5 %
## (Intercept) -1588358.77 -695054.56
## factor(sex)2 -393926.88 707144.22
## factor(period)1 -874.93 247.63
## lspline(year, c(2019), marginal = TRUE)1 347.00 789.92
## lspline(year, c(2019), marginal = TRUE)2 -705.47 72.93
## factor(sex)2:lspline(year, c(2019), marginal = TRUE)1 -350.04 195.88
## factor(sex)2:lspline(year, c(2019), marginal = TRUE)2 -217.59 729.23
## factor(sex)2:factor(period)1 -845.64 608.58
The linear spline model creates knots
that automatically replaces some of the interactions in our previous model. For example the model element factor(sex):lspline(year, c(2019)
replaces the triple interaction factor(period):factor(sex):year
from our previous model.
You’ll notice that some of the coefficient values are similar to our first model. For instance, in our first model, the year
coefficient was equal to $568, and in the linear splines model, the lspline(year, c(2019), marginal = TRUE)1
coefficient was equal to $568. This denotes the average change in annual total healthcare costs for males before the implementation of the intervention at year == 2019
.
However, you’ll also notice some differences. In the linear splines model, we have a couple of important coefficients that will help to complete our evaluation of the ITSA model.
6.12 Differences in level immediately after implementation of the intervention
One of the most important changes we are interested in is the immediate change after implementation of the intervention at year == 2019
. This is captured with the factor(sex)2:factor(period)1
coefficient, which has a value equal to -$119 (95% CI: -$846, $609). This tells me that the difference in level change at year == 2019
was lower for females than males; however, this difference was not statistically significant. See ITSA Figure below with annotations.
6.13 Difference-in-differences estimation
Lastly, we can identify the difference-in-differences estimate from the linear spline model output in the form of the factor(sex)2:lspline(year, c(2019), marginal = TRUE)2
coefficient, which has a value equal to $256 (95% CI: -$218, $729). This coefficient denotes that average annual change in total healthcare expenditure between males and females before and after implementation of the intervention. In other words, this is a difference in the slopes for males before and after the intervention minus the difference in the slopes for females before and after the intervention. Hence, the term difference-in-differences
. See ITSA Figure below with annotations.
6.14 Putting it all together
With the ITSA, we will summarize our findings into a table. These include the main coefficients from the ITSA model.
6.15 Conclusions
The ITSA model is a very useful design to generate causal interpretation as long as the assumptions hold. For instance, the parallel trends assumptions needs to be assessed. Without it, it will be difficult to argue that the intervention was the only factor that could have impacted the changes in the level and trends between the groups.
In this tutorial, I ran into an issue with the first ITSA model using interaction terms. It didn’t provide me with the necessary coefficients to determine the main coefficients for the ITSA. For instance, I had to use a linear spline model to identify the difference-in-differences estimate. Although the ITSA model with the triple interaction was useful in generating the plot, I could have started with the linear splines model and generated the ITSA plot afterwards with the triple interaction model.
At the end of the day, I was able to generate a summary table and a very nice figure of the ITSA. However, I will continue to learn more about other contrasting packages such as emmeans
and marginaleffects
.
Although the margins
package in R
doesn’t have the pwcompare
feature from Stata
, there are some potential workarounds that I’ll investigate further. Discussion about this can be found in the margins
GitHub issues forum (link)
6.16 Acknowledgements
As always, I am grateful to the R
community for producing wonderful tools for novices like me. I’ve learned so much reading documents, forums, and issues online that it would be impossible to credit everyone. But I would like to highlight a few superstars that I think deserve special recognition.
I am grateful to the author of the margins
package, Thomas J. Leeper. You can find his margins
package on his GitHub site
.
Additionally, the marginaleffects
package by Vincent Arel-Bundock has helped me to compare some of R
’s marginal effects commands with Stata
. You can find his GitHub site here.
The emmeans
package was developed by Russel V. Length, which I plan on learning more about on his GitHub site (link). Note: I found a nice overview of the emmeans
package by Ariel Muldoon that I encourage interested users to read (link).
The ggeffect
package was very useful in helping me develop some very nice plots, which you can learn more about on their GitHub site (link).
Lastly, I found this amazing site on ITSA by Chrissy H. Roberts here. Her site helped me to create the ITSA plot which is in this tutorial. I hope you get a chance to read her blog on how to conduct an ITSA in R. I think she does a much better job that I did with this tutorial.