Tweedie GLM model in R for Cost Data

2023-05-10

Introduction

Cost data is difficult to model due to the right-skewed nature of the distribution and the large point mass at zero. Generalized linear models (GLM) allow flexibility when it comes to modeling data distributions that are skewed. A common solution to modeling cost data is the GLM gamma model, which can address the right-skewed nature of the distribution. However, there are limits to the GLM gamma model, particularly when it comes to subjects having zero costs. These are not easily resolved with the GLM gamma model that uses a log link. An alternative is to use the Tweedie GLM model framework where we can use the different combinations of the family and link function to handle the zero costs issue.

In R, when you try to use the GLM gamma model with log link on cost data, you may run into problems because the gamma model will not accept values that are non-positive. This means that cost values that are negative or zero will not run. Using the Tweedie GLM framework, we can get around the problem by using a Gamma family with an identity link. The link function in a Tweedie GLM framework is a power function. A power function = 0 is akin to a log link, and a power function = 1 is akin to an identity link.

Motivating example

We will use data from the Agency for Healthcare Research and Quality (AHRQ) Medical Expenditure Panel Survey (MEPS) from 2020 to evaluate whether there are gender disparities in total healthcare expenditures across different poverty categories.

The objectives are:

1: Build a Tweedie GLM gamma model
2: Interpret the interaction term between gender and povcat20 variables
3: Estimate and interpret the marginal effects
4: Plot the marginal effects

Load libraries

There are several packages that we will need for this tutorial.

#### Load libraries
library("haven")
library("ResourceSelection")  ## Package to perform the Hosmer-Lemeshow GOF test
library("survey")
library("MEPS")
library("prediction")
library("margins")
library("ggeffects")
library("sjPlot")
library("statmod")

Import data from MEPS

We will import that data from MEPS and create an object data1 to perform our analysis.

hc2020 = read_MEPS(file = "h224")
names(hc2020) <- tolower(names(hc2020))

keep_hc2020 <- subset(hc2020, 
                      select = c(dupersid, perwt20f, varstr, varpsu, totexp20, sex, povcat20))
head(keep_hc2020)

## # A tibble: 6 × 7
##   dupersid   perwt20f varstr varpsu totexp20 sex          povcat20           
##   <chr>         <dbl>  <dbl>  <dbl>    <dbl> <dbl+lbl>    <dbl+lbl>          
## 1 2320005101    8418.   2079      1      459 2 [2 FEMALE] 2 [2 NEAR POOR]    
## 2 2320005102    5200.   2079      1      564 1 [1 MALE]   2 [2 NEAR POOR]    
## 3 2320006101    2140.   2028      1      140 2 [2 FEMALE] 3 [3 LOW INCOME]   
## 4 2320006102    2216.   2028      1     4673 1 [1 MALE]   1 [1 POOR/NEGATIVE]
## 5 2320006103    4157.   2028      1      410 1 [1 MALE]   3 [3 LOW INCOME]   
## 6 2320012102    1961.   2069      2     2726 2 [2 FEMALE] 3 [3 LOW INCOME]

data1 <- keep_hc2020

Factor variables

We will factor two variables that will be used in our example. These variables are sex and povcat20. We will create a new variable for sex called gender that uses a binary variable (0 = female, 1 = male). Poverty category has 5 levels (1, 2, 3, 4, and 5) that denotes Poor, Near Poor, Low Income, Middle Income, and High Income.

levels(data1$povcat20)
format(data1$povcat20)
factor(data1$povcat20, levels = c(1, 2, 3, 4, 5))

data1$gender[data1$sex == 1] = 0
data1$gender[data1$sex == 2] = 1
table(data1$gender)
factor(data1$gender, levels = c(0, 1))

Apply survey weights

We next apply the survey weights to our data data1.

## Apply the survey weights to the dataframe using the svydesign function
options(survey.lonely.psu = 'adjust')

mepsdsgn = svydesign(
  id = ~varpsu,
  strata = ~varstr,
  weights = ~perwt20f,
  data = data1,
  nest = TRUE)

Tweedie GLM model

We can construct our model using the Tweedie GLM framework. We will include an interaction term gender:factor(povcat20) that will provide us with various levels of change across the poverty category groups and by gender. The var.power = 2 option indicates that this is a gamma model. The link.power = 1 option indicates that we are using an identity link. (Note: link.power = 0 indicates a log link.) More information about the Tweedie GLM framework can be found on the R CRAN site.

### Model 3 (Y ~ gender + povcat20 + gender*povcat20) - This includes an interaction term
model1 <- svyglm(totexp20 ~ gender + factor(povcat20) + gender:factor(povcat20), 
                 mepsdsgn, 
                 family = tweedie(var.power = 2, link.power = 1))
summary(model1)

## 
## Call:
## svyglm(formula = totexp20 ~ gender + factor(povcat20) + gender:factor(povcat20), 
##     design = mepsdsgn, family = tweedie(var.power = 2, link.power = 1))
## 
## Survey design:
## svydesign(id = ~varpsu, strata = ~varstr, weights = ~perwt20f, 
##     data = data1, nest = TRUE)
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                5282.4      438.6  12.043   <2e-16 ***
## gender                     1107.5      496.0   2.233   0.0269 *  
## factor(povcat20)2          2946.9     1776.4   1.659   0.0990 .  
## factor(povcat20)3           665.7      638.6   1.042   0.2987    
## factor(povcat20)4          -232.3      583.5  -0.398   0.6911    
## factor(povcat20)5           988.5      581.1   1.701   0.0908 .  
## gender:factor(povcat20)2  -2988.6     1858.5  -1.608   0.1097    
## gender:factor(povcat20)3   -380.0      776.6  -0.489   0.6252    
## gender:factor(povcat20)4   -153.7      711.1  -0.216   0.8291    
## gender:factor(povcat20)5   -179.9      643.8  -0.279   0.7803    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Tweedie family taken to be 11.70468)
## 
## Number of Fisher Scoring iterations: 3

confint(model1)

##                               2.5 %    97.5 %
## (Intercept)               4416.4224 6148.3436
## gender                     128.1652 2086.7529
## factor(povcat20)2         -560.1474 6453.9665
## factor(povcat20)3         -595.0006 1926.3864
## factor(povcat20)4        -1384.3167  919.7847
## factor(povcat20)5         -158.7678 2135.7795
## gender:factor(povcat20)2 -6657.8448  680.6731
## gender:factor(povcat20)3 -1913.1583 1153.0879
## gender:factor(povcat20)4 -1557.6813 1250.2282
## gender:factor(povcat20)5 -1450.9863 1091.2267

We next perform the goodness of fit (GOF) tests to inspect if the model has issues with heteroscedastiticty or specification.

#### GOF test - Pearson correlation
predict.model1 <- prediction(model1, 
                             type = "response", 
                             calculate_se = TRUE)
res <- data1$totexp20 - predict.model1$fit
pwcorr <- cor.test(predict.model1$fit, res)
pwcorr

## 
##  Pearson's product-moment correlation
## 
## data:  predict.model1$fit and res
## t = -0.37514, df = 27803, p-value = 0.7076
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.014003542  0.009504577
## sample estimates:
##          cor 
## -0.002249793

### GOF test - Pregibon's link test
xb2 <- predict.model1$fit * predict.model1$fit  ## Square the predicted values
linear.model <- glm(data1$totexp20 ~ predict.model1$fit + xb2, 
                    family = gaussian(link = "identity"))
summary(linear.model)

## 
## Call:
## glm(formula = data1$totexp20 ~ predict.model1$fit + xb2, family = gaussian(link = "identity"))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
##   -7821    -6090    -5018    -1850  1655715  
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)        -3.722e+03  7.136e+03  -0.522    0.602
## predict.model1$fit  2.294e+00  2.279e+00   1.006    0.314
## xb2                -1.083e-04  1.807e-04  -0.599    0.549
## 
## (Dispersion parameter for gaussian family taken to be 482941352)
## 
##     Null deviance: 1.3440e+13  on 27804  degrees of freedom
## Residual deviance: 1.3427e+13  on 27802  degrees of freedom
## AIC: 634884
## 
## Number of Fisher Scoring iterations: 2

### GOF test - Modified Hosmer-Lemeshow test with 10 deciles
hoslem.test(data1$totexp20, predict.model1$fit, g = 10)

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  data1$totexp20, predict.model1$fit
## X-squared = -19.518, df = 8, p-value = 1

There does not appear to be any issues with heteroscedastiticty or misspecification.

Interpreting the marginal effects

To understand how marginal effects work, we should list out the output from the Tweedie GLM gamma model.

model1 %>%
  tbl_regression(intercept = TRUE) %>%
  bold_labels() %>%
  bold_p(t = 0.05)

Characteristic	Beta	95% CI¹	p-value
(Intercept)	5,282	4,416, 6,148	<0.001
gender	1,107	128, 2,087	0.027
factor(povcat20)
1	—	—
2	2,947	-560, 6,454	0.10
3	666	-595, 1,926	0.3
4	-232	-1,384, 920	0.7
5	989	-159, 2,136	0.091
gender * factor(povcat20)
gender * 2	-2,989	-6,658, 681	0.11
gender * 3	-380	-1,913, 1,153	0.6
gender * 4	-154	-1,558, 1,250	0.8
gender * 5	-180	-1,451, 1,091	0.8
¹ CI = Confidence Interval

For the gender variable, the beta coefficient indicates that the difference in cost between male and female subjects is $1107 (95% CI: $128, $2087). However, we have to be careful with this interpretation because the interaction term includes gender, which we can’t ignore.

Let’s estimate the total healthcare expenditure for a male and female with poverty category = 2 using the beta coefficients from the model output.

We start with writing out the two formulas for the total expenditures for males and females. We fill in some of the parameters that we know. For instance, gender = 1 is male and gender = 0 is female. The $\beta_0$ is the intercept, which cancels out. Another parameter that cancels out is povcat20.

Now, we can fill in the beta parameters. For $\beta_1$, the value is 1107; for $\beta_3$, the value is -2989. You can plug this into the equation. Since we are interacting gender with povcat20, multiplying gender = 0 with povcat20 = 2 results in zero. Therefore, we only carry over the $\beta_3$ of the interaction term with gender = 1, which is -2989.

Therefore, the only coefficients that matter are $\beta_1$ and $\beta_3$. Combining these coefficients will results in -$1882, the difference in total healthcare expenditures between males and females.

Interaction terms are a little tricky. We can interpret them in two ways. For instance, the interaction term between gender and povcat20 can be interpreted as the difference in total expenditures between males and females at poverty category = 2. Or we can interpret this as the difference in total expenditure between males (gender = 1) who are in poverty category 2 versus poverty category 1.

These two interpretation are both correct. It will be up to you to determine how you want to articulate this.

There is a nice and convenient method to estimate the marginal effects using the margins command in R. For this example, we will use the at option to indicate the stratification for gender. This will generate an output that will give us the marginal effect (or the difference in total healthcare expenditure) for each poverty category among male and female subjects.

### Marginal effects of gender on poverty
margins1 <- margins(model1, type = "response", design = mepsdsgn, at = list(gender = 0:1))
summary(margins1)

##     factor gender       AME        SE       z      p      lower     upper
##     gender 0.0000  822.8500  309.0355  2.6626 0.0078   191.2642 1402.6610
##     gender 1.0000  822.8500  308.9258  2.6636 0.0077   191.4792 1402.4461
##  povcat202 0.0000 2946.9095 1776.3933  1.6589 0.0971  -534.7573 6428.5764
##  povcat202 1.0000  -41.6763  688.7793 -0.0605 0.9518 -1391.6588 1308.3062
##  povcat203 0.0000  665.6929  638.5606  1.0425 0.2972  -585.8628 1917.2487
##  povcat203 1.0000  285.6577  577.0414  0.4950 0.6206  -845.3227 1416.6381
##  povcat204 0.0000 -232.2660  583.5314 -0.3980 0.6906 -1375.9665  911.4344
##  povcat204 1.0000 -385.9925  454.4700 -0.8493 0.3957 -1276.7374  504.7524
##  povcat205 0.0000  988.5059  581.1104  1.7011 0.0889  -150.4495 2127.4613
##  povcat205 1.0000  808.6261  478.5696  1.6897 0.0911  -129.3531 1746.6052

The output from the margins command include the AME or average marginal effect and the corresponding 95% confidence intervals (CI).

For instance, the average marginal effect of povcat20 = 2 versus povcat20 = 1 for females gender = 0 is $2947 (95% CI: -$534, $6429). The average marginal effect of povcat20 = 2 versus povcat20 = 1 for males is -$42 (95% CI: -$1392, $1308).

You can also plot these marginal effects using the plot_model command. We set type = pred because we want the predicted healthcare expenditures, and include the two terms that make up the interaction (povcat20 and gender).

### Plots the marginal effects
plot_model(model1, type = "pred", terms = c("povcat20", "gender"))

This plot is very helpful to see how different the total expenditures are between males and females at each poverty category.

In our example above, we looked at the marginal effect of males and females with a povcat20 = 2. We estimated that the difference between them was -$1882. You can visualize it on this plot. Notice the difference between the red dot and blue dot at povcat20 = 2 is -$1882.

If you look at the 95% CIs, they overlap between the males and females. Hence, visually, we do not see any statistically differences in the marginal effects for each poverty category between males and females. This aligns with the output from the marginal effects that we generated using plot_model.

Let’s tie all this together.

The differences in total healthcare expenditures between povcat20 = 2 versus povcat20 = 1 is clearly illustrated when you have the margins output next to the plot. Among the females, the marginal effect (or difference in total healthcare expenditures) of povcat20 = 2 versus povcat20 = 1 is $2947. Among the males, the marginal effect of povcat20 = 2 versus povcat20 = 1 is -$42.

I created a final table that includes the crucial values from the margins output. But I like when the figure and table are provided side by side.

Conclusions

Based on the Tweedie GLM gamma model, we do not see disparities in total healthcare expenditures between males and females across all poverty categories. We may need to look at other socioeconomic factors for potential disparities in total healthcare expenditures.

Acknowledgements

There were several online resources that I used to create this tutorial.

Daniel Ludeceke has a great website on the margins and plot_model commands.

Christopher F. Kutz’s paper, “Tweedie distributions for fitting semicontinuous health care utilization cost data,” is a great resource for understanding the Tweedie GLM framework.

The statmod package is how I was able to implement the Tweedie GLM distribution, which is located here.

Thomas J. Leeper has a great website that provides a wonderful tutorial on using the margins command in R.

Work in progress

This is a work in progress, and I will likely update this in the future.

Disclaimer

This is for educational purposes only.