Introduction
The Medical Expenditure Panel Survey (MEPS) is based on a complex survey design. Hence, it is necessary to apply survey weights to generate estimates that are representative of the United States (US) population. The weights take into account the stratification, clustering, sampling, and non-response based on the Current Population Survey. Although you can perform descriptive and complex analyses without the weights, they will not provide you with accurate standard errors of the population. Rather, not applying the weights will only yield standard errors for the sample.
Types of weights
In MEPS, there are three types of weights that are critical for most
descriptive and multivariate analyses: person weight
(perwtXXf
), stratum (varstr
), and cluster
(varpsu
). The XX
is replaced by the year of
the survey. For example, the person weight in 2020 is labelled as
perwt20f
.
Loading the data
Let’s use the MEPS Full-Year Consolidated File from 2020. From our
previous tutorial, you can load data using the MEPS
library
function read_MEPS
. There are two methods that you can use
to load data into R.
### Load the MEPS package
library("MEPS") ## You need to load the library every time you restart R
#### Method 1: Load data from AHRQ MEPS website
hc2020 = read_MEPS(file = "h224")
#### Method 2: Load data from AHRQ MEPS website
hc2020 = read_MEPS(year = 2020, type = "FYC")
## Change column names to lowercase
names(hc2020) <- tolower(names(hc2020))
Once the data has been loaded, we can look at how many variables there are.
## The number of columns represents the number of variables in the hc2020 dataframe.
ncol(hc2020)
## [1] 1451
We have over 1400 variable. This is a very large dataframe. We can
reduce this to a manageable size by keeping only the variables that are
important. Let’s keep the unique patient identifier
(dupersid
), weights (perwt20f
,
varstr
, and varpsu
), and the total
expenditures (totexp20
).
## Create a smaller dataframe
keep_hc2020 <- subset(hc2020, select = c(dupersid, perwt20f, varstr, varpsu, totexp20, sex, povcat20))
head(keep_hc2020)
## # A tibble: 6 × 7
## dupersid perwt20f varstr varpsu totexp20 sex povcat20
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl+lbl> <dbl+lbl>
## 1 2320005101 8418. 2079 1 459 2 [2 FEMALE] 2 [2 NEAR POOR]
## 2 2320005102 5200. 2079 1 564 1 [1 MALE] 2 [2 NEAR POOR]
## 3 2320006101 2140. 2028 1 140 2 [2 FEMALE] 3 [3 LOW INCOME]
## 4 2320006102 2216. 2028 1 4673 1 [1 MALE] 1 [1 POOR/NEGATIVE]
## 5 2320006103 4157. 2028 1 410 1 [1 MALE] 3 [3 LOW INCOME]
## 6 2320012102 1961. 2069 2 2726 2 [2 FEMALE] 3 [3 LOW INCOME]
We can add labels to the sex
variable where
1 = male
and 2 = female
.
Perform descriptive analysis
Now that we have a smaller dataframe with the variables of interest, let’s apply the survey weights to some descriptive analysis.
Suppose you were interested in the average age of the cohort. You
will need to apply the survey weights to generate the mean and standard
deviation. The survey
package comes with the
svydesign
function, which uses the survey weights in the
Full-Year Consolidated File data and applies them to the cohort in
preparation for analyses.
First, you will need to set the options to adjust
, which
centers the single-PSU strata arund the grand mean rather than the
stratum mean. With MEPS data, we are using single-PSU (or “lonely” PSU),
which is used to estimate the variance by calculating the difference of
the sum of the statum’s PSU and the average statum’s PSU. The, we use
the svydesign
function to generate a complex survey design
dataset (which we will call mepsdsgn
) for analysis by
applying the survey weights.
## Load the "survey" package
library("survey")
## Apply the survey weights to the dataframe using the svydesign function
options(survey.lonely.psu = 'adjust')
mepsdsgn = svydesign(
id = ~varpsu,
strata = ~varstr,
weights = ~perwt20f,
data = keep_hc2020,
nest = TRUE)
Once the survey weights have been applied, we can use the
survey
functions to perform some descriptive analysis on
the mepsdsgn
data.
First, let’s see how many patients we have that is representative of
the US population by sex. We use the svytable
function to
generate the weight sample for males and females. Adding these together
will yield the weighted sample of the US population.
## Weighted sample of the population stratified by sex
svytable(~sex, design = mepsdsgn)
## sex
## 1 - Male 2 - Female
## 160960989 167584308
Using the survey weights, there are 160,960,989 males and 167,584,308
females. In total, there are 328,545,297 weighted subjects in the
mepsdsgn
data.
Let’s move on and estimate the average total expenditures for the total sample.
## Estimate the weighted mean total expenditure for the sample
svymean(~totexp20, design = mepsdsgn)
## mean SE
## totexp20 6266.1 164.38
The svymean
function generates the appropriate average
and standard error (SE) of the total sample that is representative of
the US population. In 2020, the average total expenditure was $6266 (SE,
164).
In our mepsdsgn
data, we have sex, which is a binary
variable. Let’s estimate the total expenditures between males and
females in the MEPS Full-Year Consolidated data. To estimate the mean
between two groups, we’ll need to use the svyby
function
along with the svymean
function.
## Estimate the weight mean total expenditure for males and females
svyby(~totexp20, ~sex, mepsdsgn, svymean)
## sex totexp20 se
## 1 - Male 1 - Male 5861.278 243.5624
## 2 - Female 2 - Female 6654.998 205.0776
The average total expenditures for male and female are $5861 (SE, 244) and $6655 (SE 205), respectively.
We can perform crosstabulations with the svytable
function. Let’s look at the distribution of males and females across
various poverty categories. In the MEPS codebook, poverty category are
groups as: 1 = Poor/Negative, 2 = Near Poor, 3 = Low Income, 4 = Middle
Income, and 5 = High Income.
## Crosstab sex and poverty category
svytable(~sex + povcat20, design = mepsdsgn)
## povcat20
## sex 1 2 3 4 5
## 1 - Male 16644995 5955576 18910001 45083571 74366846
## 2 - Female 21002826 6672752 21753502 47750327 70404901
To generate the proportions, you will need to use
prop.table
. We add the margin = 1
option to
calculate the column total.
prop.table(svytable(~sex + povcat20, design = mepsdsgn), margin = 1) ### margin = 1 calculates the column total.
## povcat20
## sex 1 2 3 4 5
## 1 - Male 0.10341012 0.03700012 0.11748189 0.28009005 0.46201783
## 2 - Female 0.12532693 0.03981729 0.12980632 0.28493316 0.42011631
We can combine these into a contingency table using the
tbl_svysummary
function from the gtsummary
package. We will also use the tidyverse
package to
manipulate the data more easily.
## Load libraries
library("tidyverse")
library("gtsummary")
## Contingency table (crosstabulations between sex and poverty category)
mepsdsgn %>%
tbl_svysummary(by = sex, percent = "column", include = c(povcat20))
Characteristic | 1 - Male, N = 160,960,9891 | 2 - Female, N = 167,584,3081 |
---|---|---|
FAMILY INC AS % OF POVERTY LINE - CATEGORICAL | ||
1 | 16,644,995 (10%) | 21,002,826 (13%) |
2 | 5,955,576 (3.7%) | 6,672,752 (4.0%) |
3 | 18,910,001 (12%) | 21,753,502 (13%) |
4 | 45,083,571 (28%) | 47,750,327 (28%) |
5 | 74,366,846 (46%) | 70,404,901 (42%) |
1 n (%) |
Based on these weighted sample numbers, there are more males who are in the High Income category compared to females (46% versus 42%).
Conclusions
The MEPS data uses weights to generate estimations that are
reflective of the US population. The survey
package from R
will allow us to apply these weights using the svydesign
function, which requires us to enter the patient weight, stratum, and
cluster values. Once these are applied, we can use the suite of
functions from the survey
package to perform descriptive
analysis on the population. The svymean
generates the
population average and the svytable
generates the
population frequencies. Having a good understanding of how these weights
are used with MEPS data will allow you to generate estimates of the
population in your epidemiology work.
Acknowledgements
The survey
package and functions were developed by Thomas Lumley and
can be found here.
The gtsummary
package and instructions are developed by
Daniel D. Sjoberg, Joseph Larmarange, Michael Curry, Jessica Lavery,
Karissa Whiting, Emily C. Zabor, which can be found at their website.
This is a work in progress, and I expect to make updates in the future.