Cost as a dependent variable: Application of the two-part model

Mark Bounthavong

30 April 2023

Introduction

Modeling cost in a sample is problematic because of the large amount of subjects with zero costs and the long right-tail distribution. Although ordinary least squares (OLS) model can generate the mean costs, it doesn’t do a good job of fitting the rest of the distribution, particularly when there is a large density of subjects with zero costs.

The two-part model take into account the large point mass of subjects with zero costs and non-parametric properties of the cost distribution. The first part of the model considers the probability that a subject has non-zero costs. The second part of the model fits the cost distribution data conditioned on the first part where the subject has non-zero costs. The structural expression of the two-part model:

\[ E[Y | \mathbf{X}] = Pr(Y > 0 | \mathbf{X}) * E[Y | Y > 0, \mathbf{X}] \]

The first part $Pr(Y > 0 | \mathbf{X})$ denotes the probability that a subject has non-zero costs given a set of variables $\mathbf{X}$. The first part of the model can be either a logit or probit model.

The second part $E[Y | Y > 0, \mathbf{X}]$ denotes the expected costs $Y$ given that the subject has non-zero costs $Y > 0$ and a set of variables $\mathbf{X}$. The second part of the two-part model can be any regression model that will fit the data. Hence, OLS, log-transformed, and generalized linear models can be used in the second part.

Personally, I prefer to the logit model for the first part and a GLM gamma model for the second part. These specifications seem to consistently fit healthcare expenditure data well.

For this tutorial, we will use R to construct a two-part model for healthcare expenditures in a sample of subjects with hypertension from the Medical Expenditure Panel Survey.

Motivating example

To explore how to use the two-part model, we will use total healthcare expenditure among respondents with hypertension from the 2017 Medical Expenditure Panel Survey (MEPS) data. Specifically, we want to estimate the average healthcare expenditure among subjects with hypertension in the US in 2017. You can import the data from my GitHub site directly into R.

We will focus on adults (age >= 18 years old) with hypertension in the United States (US). To read the data, we will use the haven package, which will allow us to read a Stata *.dta file into R.

library("haven")        ## Package to allow import of Stata *.dta file

#### Import data
data1 <- read_dta("https://github.com/mbounthavong/Cost-as-a-dependent-variable/blob/main/Data/limited_data.dta?raw=true")

#### Restrict data to adults age > = 18 years old
data1 <- data1[(data1$age17x >= "18"), ]

There should be 7,872 subjects in the data after restricting the sample to adults with hypertension.

Next, we visualize the distribution of the healthcare expenditure using a histogram.

hist(data1$totexp17)

summary(data1$totexp17)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      0.0    953.8   3517.0  10625.1  10547.8 552898.0

The average healthcare expenditure is $10,625 per subject. Notice the large point mass of subjects with zero or near zero costs. Also note the long right-tail of the healthcare expenditure distribution. This driven by a few subjects with large healthcare expenditures

Two-part model

To construct the two-part model, you will need to install and load the twopartm package into R.

library("twopartm")     ## Package to perform two-part models

The main function that will be used is the tpm, which will allow you to build the first and second parts of the model.

Generally, I use the same variables for the first and second parts of the two-part models. But some users may want to build a specific probability model for the first part that is different from the second part. To build the first part, you will need to include the formula for the formula_part1 option. To build the second part, you will need to include the formula for the formula_part2 option.

Here is an example: tp.model <- tpm(formula_part1 = totexp17 ~ age17x + sex, forumla_part2 = totexp17 ~ age17x + sex + racev2x + hispanx, data = data1, link_part1 = "logit", family_part2 = Gamma(link = "log"))

For this exercise, we will use the same variables in both parts of the two-part model.

The first part can be either a logit or probit model. In this motivating example, we’ll use the logit model for the first part.

The second part is a gamma model because it captures the cost distribution ranges between zero and infinity. It cannot be a negative value.

#### Two-part model
tpm.model1 <- tpm(totexp17 ~ age17x + sex + racev2x + hispanx + marry17x + povcat17, data = data1, link_part1 = "logit", family_part2 = Gamma(link = "log"))
summary(tpm.model1)

## $Firstpart.model
## 
## Call:
## glm(formula = nonzero ~ age17x + sex + racev2x + hispanx + marry17x + 
##     povcat17, family = binomial(link = "logit"), data = data1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1834   0.1905   0.2623   0.3660   1.1588  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.100175   0.202444  -0.495  0.62072    
## age17x       0.047851   0.003452  13.863  < 2e-16 ***
## sex          0.640344   0.105028   6.097 1.08e-09 ***
## racev2x     -0.143391   0.048756  -2.941  0.00327 ** 
## hispanx     -0.812953   0.110506  -7.357 1.89e-13 ***
## marry17x     0.018434   0.047655   0.387  0.69888    
## povcat17     0.104063   0.034358   3.029  0.00246 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3466.2  on 7871  degrees of freedom
## Residual deviance: 3096.6  on 7865  degrees of freedom
## AIC: 3110.6
## 
## Number of Fisher Scoring iterations: 6
## 
## 
## $Secondpart.model
## 
## Call:
## glm(formula = totexp17 ~ age17x + sex + racev2x + hispanx + marry17x + 
##     povcat17, family = Gamma(link = "log"), data = data1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.1913  -1.5775  -0.8690  -0.0207  13.2098  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.657839   0.141326  61.261  < 2e-16 ***
## age17x       0.013905   0.001987   6.997 2.85e-12 ***
## sex          0.061793   0.057849   1.068   0.2855    
## racev2x     -0.009336   0.030848  -0.303   0.7622    
## hispanx     -0.186162   0.076170  -2.444   0.0145 *  
## marry17x     0.015766   0.028082   0.561   0.5745    
## povcat17    -0.089098   0.020212  -4.408 1.06e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 5.939428)
## 
##     Null deviance: 17205  on 7418  degrees of freedom
## Residual deviance: 16712  on 7412  degrees of freedom
## AIC: 150858
## 
## Number of Fisher Scoring iterations: 9

The output will contain the results from the first and second parts of the model. However, this doesn’t provide us with the average total healthcare expenditure for a subject with hypertension in the US in 2017.

The get the average total healthcare expenditure for the sample, we will use the predict function. We’ll estimate the average total healthcare expenditure and then the standard errors. We set the se.fit = TRUE option because we want the function to estimate the standard errors using the Delta method.

#### Generated the estimated total costs for each subject and summarize for the sample
predict1 <- predict(tpm.model1, se.fit = TRUE)
summary(predict1$fit)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2797    8365   10419   10625   12600   20732

summary(predict1$se.fit)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   345.0   542.7   692.3   770.7   936.7  2533.9

According to the results from the predict function, the average healthcare expenditure among adults with hypertension in the US in 2017 was $10,625.00 with a standard error of $770.70. We can generate the 95% confidence interval (CI) by using the following equations:

\[ LL = Mean - 1.96 * SE \] \[ UL = Mean + 1.96 * SE \] The lower limit (LL) and upper limit (UL) denote the range of the 95% CI.

predict1$ci <- 1.96*predict1$se.fit
predict1$ll <- predict1$fit - predict1$ci
predict1$ul <- predict1$fit + predict1$ci
summary(predict1$ll)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1834    7142    9022    9115   10921   17320

summary(predict1$ul)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3473    9575   11740   12136   14330   24479

Based on these calculations, the 95% CI is between $9,115 and $12,136.

We can perform goodness of fit (GOF) tests to see if we have the correct specification for the two-part model. There are three GOF tests that we can perform.

The Pearson correlation tests for any correlation between the costs predictions and residual costs. We do not want to have any correlation between the predictions and residuals.

The Pregibon link test assesses whether the model is properly specified. We square the predicted estimates. and perform a linear regression with the predicted estimates. and the square of the predicted estimates. If there is no association with the square of the predicted estimates and the raw total healthcare expenditures, then the model specificity is considered fine.

The modified Hosmer-Lemeshow test assesses whether the mean residuals are equal to zero. I normally use 10 deciles of the predicted estimates for this test. If this is not statistically significant, then the model is correctly specified.

#### GOF tests
res <- data1$totexp17 - predict1$fit    ## Estimate the residuals
summary(res)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -20492.7  -8725.3  -5974.2     -0.2   -468.3 542082.9

### Pearson correlation
pwcorr <- cor.test(predict1$fit, res)
pwcorr

## 
##  Pearson's product-moment correlation
## 
## data:  predict1$fit and res
## t = -0.27607, df = 7870, p-value = 0.7825
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.02520129  0.01898051
## sample estimates:
##          cor 
## -0.003111908

### Pregibon's link test
xb2 <- predict1$fit * predict1$fit  ## Square the predicted values
linear.model <- glm(data1$totexp17 ~ predict1$fit + xb2, family = gaussian(link = "identity"))
summary(linear.model)

## 
## Call:
## glm(formula = data1$totexp17 ~ predict1$fit + xb2, family = gaussian(link = "identity"))
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -17864   -8937   -5961    -479  541759  
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.522e+03  2.432e+03  -1.448 0.147620    
## predict1$fit  1.704e+00  4.424e-01   3.851 0.000118 ***
## xb2          -3.219e-05  1.925e-05  -1.672 0.094512 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 540891244)
## 
##     Null deviance: 4.3328e+12  on 7871  degrees of freedom
## Residual deviance: 4.2563e+12  on 7869  degrees of freedom
## AIC: 180641
## 
## Number of Fisher Scoring iterations: 2

### Modified Hosmer-Lemeshow test with 10 deciles
library("ResourceSelection")  ## Package to perform the Hosmer-Lemeshow GOF test
hoslem.test(data1$totexp17, predict1$fit, g = 10)

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  data1$totexp17, predict1$fit
## X-squared = -30.975, df = 8, p-value = 1

According to the results of the GOF tests, there does not appear to be any issues of mis-specification with the two-part model. Hence, we can conclude that the two-part model correctly models the total healthcare expenditures for subjects with hypertension in the US in 2017.

Conclusion

The average healthcare expenditure among subjects with hypertension in the US in 2017 was $10,625 (95% CI: $9,115, $12,136). Based on the GOF tests, we can conclude that two-part model appears to correctly fit the raw data for total healthcare expenditures. Interestingly, the two-part model package twopartm in R is based on the twopm package for Stata. When comparing the two packages, the mean healthcare expenditure were exactly the same. However, I noticed some differences in the 95% CI generated between R and Stata packages. I’m not sure why the standard errors were different. I tried looking at the documentations, but I was unable to find a satisfying answer. I will continue to work on this issue and update when needed.

Acknowledgements

There are several papers that were helpful when developing this article.

Belotti, et al has a great paper on twopm package for Stata that you can read here

Deb and Norton have a fantastic paper in the Annual Review of Public Health titled “Modeling Health Care Expenditures and Use” that reviews the different models for cost data. You can find their paper here

Duan created the twopartm package for R, which you can learn more about here

Work in progress

This is a work in progress, and I’ll likely make updates as I learn more.