Introduction
Cost data is difficult to model due to the right-skewed nature of the distribution and the large point mass at zero. Generalized linear models (GLM) allow flexibility when it comes to modeling data distributions that are skewed. A common solution to modeling cost data is the GLM gamma model, which can address the right-skewed nature of the distribution. However, there are limits to the GLM gamma model, particularly when it comes to subjects having zero costs. These are not easily resolved with the GLM gamma model that uses a log link. An alternative is to use the Tweedie GLM model framework where we can use the different combinations of the family and link function to handle the zero costs issue.
In R, when you try to use the GLM gamma model with log link on cost data, you may run into problems because the gamma model will not accept values that are non-positive. This means that cost values that are negative or zero will not run. Using the Tweedie GLM framework, we can get around the problem by using a Gamma family with an identity link. The link function in a Tweedie GLM framework is a power function. A power function = 0 is akin to a log link, and a power function = 1 is akin to an identity link.
Motivating example
We will use data from the Agency for Healthcare Research and Quality (AHRQ) Medical Expenditure Panel Survey (MEPS) from 2020 to evaluate whether there are gender disparities in total healthcare expenditures across different poverty categories.
The objectives are:
1: Build a Tweedie GLM gamma model
2: Interpret the interaction term between
gender
andpovcat20
variables3: Estimate and interpret the marginal effects
4: Plot the marginal effects
Load libraries
There are several packages that we will need for this tutorial.
#### Load libraries
library("haven")
library("ResourceSelection") ## Package to perform the Hosmer-Lemeshow GOF test
library("survey")
library("MEPS")
library("prediction")
library("margins")
library("ggeffects")
library("sjPlot")
library("statmod")
Import data from MEPS
We will import that data from MEPS and create an object
data1
to perform our analysis.
hc2020 = read_MEPS(file = "h224")
names(hc2020) <- tolower(names(hc2020))
keep_hc2020 <- subset(hc2020,
select = c(dupersid, perwt20f, varstr, varpsu, totexp20, sex, povcat20))
head(keep_hc2020)
## # A tibble: 6 × 7
## dupersid perwt20f varstr varpsu totexp20 sex povcat20
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl+lbl> <dbl+lbl>
## 1 2320005101 8418. 2079 1 459 2 [2 FEMALE] 2 [2 NEAR POOR]
## 2 2320005102 5200. 2079 1 564 1 [1 MALE] 2 [2 NEAR POOR]
## 3 2320006101 2140. 2028 1 140 2 [2 FEMALE] 3 [3 LOW INCOME]
## 4 2320006102 2216. 2028 1 4673 1 [1 MALE] 1 [1 POOR/NEGATIVE]
## 5 2320006103 4157. 2028 1 410 1 [1 MALE] 3 [3 LOW INCOME]
## 6 2320012102 1961. 2069 2 2726 2 [2 FEMALE] 3 [3 LOW INCOME]
data1 <- keep_hc2020
Factor variables
We will factor two variables that will be used in our example. These
variables are sex
and povcat20
. We will create
a new variable for sex called gender
that uses a binary
variable (0 = female, 1 = male). Poverty category has 5 levels
(1
, 2
, 3
, 4
, and
5
) that denotes Poor, Near Poor, Low Income, Middle Income,
and High Income.
levels(data1$povcat20)
format(data1$povcat20)
factor(data1$povcat20, levels = c(1, 2, 3, 4, 5))
data1$gender[data1$sex == 1] = 0
data1$gender[data1$sex == 2] = 1
table(data1$gender)
factor(data1$gender, levels = c(0, 1))
Apply survey weights
We next apply the survey weights to our data data1
.
## Apply the survey weights to the dataframe using the svydesign function
options(survey.lonely.psu = 'adjust')
mepsdsgn = svydesign(
id = ~varpsu,
strata = ~varstr,
weights = ~perwt20f,
data = data1,
nest = TRUE)
Tweedie GLM model
We can construct our model using the Tweedie GLM framework. We will
include an interaction term gender:factor(povcat20)
that
will provide us with various levels of change across the poverty
category groups and by gender. The var.power = 2
option
indicates that this is a gamma model. The link.power = 1
option indicates that we are using an identity link. (Note:
link.power = 0
indicates a log link.) More information
about the Tweedie GLM framework can be found on the R
CRAN site.
### Model 3 (Y ~ gender + povcat20 + gender*povcat20) - This includes an interaction term
model1 <- svyglm(totexp20 ~ gender + factor(povcat20) + gender:factor(povcat20),
mepsdsgn,
family = tweedie(var.power = 2, link.power = 1))
summary(model1)
##
## Call:
## svyglm(formula = totexp20 ~ gender + factor(povcat20) + gender:factor(povcat20),
## design = mepsdsgn, family = tweedie(var.power = 2, link.power = 1))
##
## Survey design:
## svydesign(id = ~varpsu, strata = ~varstr, weights = ~perwt20f,
## data = data1, nest = TRUE)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5282.4 438.6 12.043 <2e-16 ***
## gender 1107.5 496.0 2.233 0.0269 *
## factor(povcat20)2 2946.9 1776.4 1.659 0.0990 .
## factor(povcat20)3 665.7 638.6 1.042 0.2987
## factor(povcat20)4 -232.3 583.5 -0.398 0.6911
## factor(povcat20)5 988.5 581.1 1.701 0.0908 .
## gender:factor(povcat20)2 -2988.6 1858.5 -1.608 0.1097
## gender:factor(povcat20)3 -380.0 776.6 -0.489 0.6252
## gender:factor(povcat20)4 -153.7 711.1 -0.216 0.8291
## gender:factor(povcat20)5 -179.9 643.8 -0.279 0.7803
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Tweedie family taken to be 11.70468)
##
## Number of Fisher Scoring iterations: 3
confint(model1)
## 2.5 % 97.5 %
## (Intercept) 4416.4224 6148.3436
## gender 128.1652 2086.7529
## factor(povcat20)2 -560.1474 6453.9665
## factor(povcat20)3 -595.0006 1926.3864
## factor(povcat20)4 -1384.3167 919.7847
## factor(povcat20)5 -158.7678 2135.7795
## gender:factor(povcat20)2 -6657.8448 680.6731
## gender:factor(povcat20)3 -1913.1583 1153.0879
## gender:factor(povcat20)4 -1557.6813 1250.2282
## gender:factor(povcat20)5 -1450.9863 1091.2267
We next perform the goodness of fit (GOF) tests to inspect if the model has issues with heteroscedastiticty or specification.
#### GOF test - Pearson correlation
predict.model1 <- prediction(model1,
type = "response",
calculate_se = TRUE)
res <- data1$totexp20 - predict.model1$fit
pwcorr <- cor.test(predict.model1$fit, res)
pwcorr
##
## Pearson's product-moment correlation
##
## data: predict.model1$fit and res
## t = -0.37514, df = 27803, p-value = 0.7076
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.014003542 0.009504577
## sample estimates:
## cor
## -0.002249793
### GOF test - Pregibon's link test
xb2 <- predict.model1$fit * predict.model1$fit ## Square the predicted values
linear.model <- glm(data1$totexp20 ~ predict.model1$fit + xb2,
family = gaussian(link = "identity"))
summary(linear.model)
##
## Call:
## glm(formula = data1$totexp20 ~ predict.model1$fit + xb2, family = gaussian(link = "identity"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -7821 -6090 -5018 -1850 1655715
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.722e+03 7.136e+03 -0.522 0.602
## predict.model1$fit 2.294e+00 2.279e+00 1.006 0.314
## xb2 -1.083e-04 1.807e-04 -0.599 0.549
##
## (Dispersion parameter for gaussian family taken to be 482941352)
##
## Null deviance: 1.3440e+13 on 27804 degrees of freedom
## Residual deviance: 1.3427e+13 on 27802 degrees of freedom
## AIC: 634884
##
## Number of Fisher Scoring iterations: 2
### GOF test - Modified Hosmer-Lemeshow test with 10 deciles
hoslem.test(data1$totexp20, predict.model1$fit, g = 10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: data1$totexp20, predict.model1$fit
## X-squared = -19.518, df = 8, p-value = 1
There does not appear to be any issues with heteroscedastiticty or misspecification.
Interpreting the marginal effects
To understand how marginal effects work, we should list out the output from the Tweedie GLM gamma model.
model1 %>%
tbl_regression(intercept = TRUE) %>%
bold_labels() %>%
bold_p(t = 0.05)
Characteristic | Beta | 95% CI1 | p-value |
---|---|---|---|
(Intercept) | 5,282 | 4,416, 6,148 | <0.001 |
gender | 1,107 | 128, 2,087 | 0.027 |
factor(povcat20) | |||
1 | — | — | |
2 | 2,947 | -560, 6,454 | 0.10 |
3 | 666 | -595, 1,926 | 0.3 |
4 | -232 | -1,384, 920 | 0.7 |
5 | 989 | -159, 2,136 | 0.091 |
gender * factor(povcat20) | |||
gender * 2 | -2,989 | -6,658, 681 | 0.11 |
gender * 3 | -380 | -1,913, 1,153 | 0.6 |
gender * 4 | -154 | -1,558, 1,250 | 0.8 |
gender * 5 | -180 | -1,451, 1,091 | 0.8 |
1 CI = Confidence Interval |
For the gender
variable, the beta coefficient indicates
that the difference in cost between male and female subjects is $1107
(95% CI: $128, $2087). However, we have to be careful with this
interpretation because the interaction term includes
gender
, which we can’t ignore.
Let’s estimate the total healthcare expenditure for a male and female with poverty category = 2 using the beta coefficients from the model output.
We start with writing out the two formulas for the total expenditures
for males and females. We fill in some of the parameters that we know.
For instance, gender = 1
is male and
gender = 0
is female. The \(\beta_0\) is the intercept, which cancels
out. Another parameter that cancels out is povcat20
.
Now, we can fill in the beta parameters. For \(\beta_1\), the value is 1107; for \(\beta_3\), the value is -2989. You can plug
this into the equation. Since we are interacting gender
with povcat20
, multiplying gender = 0
with
povcat20 = 2
results in zero. Therefore, we only carry over
the \(\beta_3\) of the interaction term
with gender = 1
, which is -2989.
Therefore, the only coefficients that matter are \(\beta_1\) and \(\beta_3\). Combining these coefficients will results in -$1882, the difference in total healthcare expenditures between males and females.
Interaction terms are a little tricky. We can interpret them in two
ways. For instance, the interaction term between gender
and
povcat20
can be interpreted as the difference in total
expenditures between males and females at poverty category = 2. Or we
can interpret this as the difference in total expenditure between males
(gender = 1
) who are in poverty category 2 versus poverty
category 1.
These two interpretation are both correct. It will be up to you to determine how you want to articulate this.
There is a nice and convenient method to estimate the marginal
effects using the margins
command in R. For this example,
we will use the at
option to indicate the stratification
for gender
. This will generate an output that will give us
the marginal effect (or the difference in total healthcare expenditure)
for each poverty category among male and female subjects.
### Marginal effects of gender on poverty
margins1 <- margins(model1, type = "response", design = mepsdsgn, at = list(gender = 0:1))
summary(margins1)
## factor gender AME SE z p lower upper
## gender 0.0000 822.8500 309.0355 2.6626 0.0078 191.2642 1402.6610
## gender 1.0000 822.8500 308.9258 2.6636 0.0077 191.4792 1402.4461
## povcat202 0.0000 2946.9095 1776.3933 1.6589 0.0971 -534.7573 6428.5764
## povcat202 1.0000 -41.6763 688.7793 -0.0605 0.9518 -1391.6588 1308.3062
## povcat203 0.0000 665.6929 638.5606 1.0425 0.2972 -585.8628 1917.2487
## povcat203 1.0000 285.6577 577.0414 0.4950 0.6206 -845.3227 1416.6381
## povcat204 0.0000 -232.2660 583.5314 -0.3980 0.6906 -1375.9665 911.4344
## povcat204 1.0000 -385.9925 454.4700 -0.8493 0.3957 -1276.7374 504.7524
## povcat205 0.0000 988.5059 581.1104 1.7011 0.0889 -150.4495 2127.4613
## povcat205 1.0000 808.6261 478.5696 1.6897 0.0911 -129.3531 1746.6052
The output from the margins
command include the
AME
or average marginal effect and the corresponding 95%
confidence intervals (CI).
For instance, the average marginal effect of
povcat20 = 2
versus povcat20 = 1
for females
gender = 0
is $2947 (95% CI: -$534, $6429). The average
marginal effect of povcat20 = 2
versus
povcat20 = 1
for males is -$42 (95% CI: -$1392, $1308).
You can also plot these marginal effects using the
plot_model
command. We set type = pred
because
we want the predicted healthcare expenditures, and include the two terms
that make up the interaction (povcat20
and
gender
).
### Plots the marginal effects
plot_model(model1, type = "pred", terms = c("povcat20", "gender"))
This plot is very helpful to see how different the total expenditures are between males and females at each poverty category.
In our example above, we looked at the marginal effect of males and
females with a povcat20 = 2
. We estimated that the
difference between them was -$1882. You can visualize it on this plot.
Notice the difference between the red dot and blue dot at
povcat20 = 2
is -$1882.
If you look at the 95% CIs, they overlap between the males and
females. Hence, visually, we do not see any statistically differences in
the marginal effects for each poverty category between males and
females. This aligns with the output from the marginal effects that we
generated using plot_model
.
Let’s tie all this together.
The differences in total healthcare expenditures between
povcat20 = 2
versus povcat20 = 1
is clearly
illustrated when you have the margins output next to the plot. Among the
females, the marginal effect (or difference in total healthcare
expenditures) of povcat20 = 2
versus
povcat20 = 1
is $2947. Among the males, the marginal effect
of povcat20 = 2
versus povcat20 = 1
is
-$42.
I created a final table that includes the crucial values from the
margins
output. But I like when the figure and table are
provided side by side.
Conclusions
Based on the Tweedie GLM gamma model, we do not see disparities in total healthcare expenditures between males and females across all poverty categories. We may need to look at other socioeconomic factors for potential disparities in total healthcare expenditures.
Acknowledgements
There were several online resources that I used to create this tutorial.
Daniel Ludeceke has a great website
on the margins
and plot_model
commands.
Christopher F. Kutz’s paper, “Tweedie distributions for fitting semicontinuous health care utilization cost data,” is a great resource for understanding the Tweedie GLM framework.
The statmod
package is how I was able to implement the
Tweedie GLM distribution, which is located here.
Thomas J. Leeper has a great website that provides a
wonderful tutorial on using the margins
command in R.
Work in progress
This is a work in progress, and I will likely update this in the future.
Disclaimer
This is for educational purposes only.