Chapter 3 Applying weights
3.1 Introduction
The Medical Expenditure Panel Survey (MEPS) is based on a complex survey design. Hence, it is necessary to apply survey weights to generate estimates that are representative of the United States (US) population. The weights take into account the stratification, clustering, sampling, and non-response based on the Current Population Survey. Although you can perform descriptive and complex analyses without the weights, they will not provide you with accurate standard errors of the population. Rather, not applying the weights will only yield standard errors for the sample.
3.2 Types of weights
In MEPS, there are three types of weights that are critical for most descriptive and multivariate analyses: person weight (perwtXXf
), stratum (varstr
), and cluster (varpsu
). The XX
is replaced by the year of the survey. For example, the person weight in 2020 is labelled as perwt20f
.
3.3 Loading the data
Let’s use the MEPS Full-Year Consolidated File from 2020. From our previous tutorial, you can load data using the MEPS
library function read_MEPS
. There are two methods that you can use to load data into R.
### Load the MEPS package
library("MEPS") ## You need to load the library every time you restart R
#### Method 1: Load data from AHRQ MEPS website
= read_MEPS(file = "h224")
hc2020
#### Method 2: Load data from AHRQ MEPS website
= read_MEPS(year = 2020, type = "FYC")
hc2020
## Change column names to lowercase
names(hc2020) <- tolower(names(hc2020))
Once the data has been loaded, we can look at how many variables there are.
## The number of columns represents the number of variables in the hc2020 dataframe.
ncol(hc2020)
## [1] 1451
We have over 1400 variable. This is a very large dataframe. We can reduce this to a manageable size by keeping only the variables that are important. Let’s keep the unique patient identifier (dupersid
), weights (perwt20f
, varstr
, and varpsu
), and the total expenditures (totexp20
).
## Create a smaller dataframe
<- subset(hc2020, select = c(dupersid, perwt20f, varstr, varpsu, totexp20, sex, povcat20))
keep_hc2020 head(keep_hc2020)
## # A tibble: 6 × 7
## dupersid perwt20f varstr varpsu totexp20 sex povcat20
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl+lbl> <dbl+lbl>
## 1 2320005101 8418. 2079 1 459 2 [2 FEMALE] 2 [2 NEAR POOR]
## 2 2320005102 5200. 2079 1 564 1 [1 MALE] 2 [2 NEAR POOR]
## 3 2320006101 2140. 2028 1 140 2 [2 FEMALE] 3 [3 LOW INCOME]
## 4 2320006102 2216. 2028 1 4673 1 [1 MALE] 1 [1 POOR/NEGATIVE]
## 5 2320006103 4157. 2028 1 410 1 [1 MALE] 3 [3 LOW INCOME]
## 6 2320012102 1961. 2069 2 2726 2 [2 FEMALE] 3 [3 LOW INCOME]
We can add labels to the sex
variable where 1 = male
and 2 = female
.
3.4 Perform descriptive analysis
Now that we have a smaller dataframe with the variables of interest, let’s apply the survey weights to some descriptive analysis.
Suppose you were interested in the average age of the cohort. You will need to apply the survey weights to generate the mean and standard deviation. The survey
package comes with the svydesign
function, which uses the survey weights in the Full-Year Consolidated File data and applies them to the cohort in preparation for analyses.
First, you will need to set the options to adjust
, which centers the single-PSU strata arund the grand mean rather than the stratum mean. With MEPS data, we are using single-PSU (or “lonely” PSU), which is used to estimate the variance by calculating the difference of the sum of the statum’s PSU and the average statum’s PSU. The, we use the svydesign
function to generate a complex survey design dataset (which we will call mepsdsgn
) for analysis by applying the survey weights.
## Load the "survey" package
library("survey")
## Apply the survey weights to the dataframe using the svydesign function
options(survey.lonely.psu = 'adjust')
= svydesign(
mepsdsgn id = ~varpsu,
strata = ~varstr,
weights = ~perwt20f,
data = keep_hc2020,
nest = TRUE)
Once the survey weights have been applied, we can use the survey
functions to perform some descriptive analysis on the mepsdsgn
data.
First, let’s see how many patients we have that is representative of the US population by sex. We use the svytable
function to generate the weight sample for males and females. Adding these together will yield the weighted sample of the US population.
## Weighted sample of the population stratified by sex
svytable(~sex, design = mepsdsgn)
## sex
## 1 - Male 2 - Female
## 160960989 167584308
Using the survey weights, there are 160,960,989 males and 167,584,308 females. In total, there are 328,545,297 weighted subjects in the mepsdsgn
data.
Let’s move on and estimate the average total expenditures for the total sample.
## Estimate the weighted mean total expenditure for the sample
svymean(~totexp20, design = mepsdsgn)
## mean SE
## totexp20 6266.1 164.38
The svymean
function generates the appropriate average and standard error (SE) of the total sample that is representative of the US population. In 2020, the average total expenditure was $6266 (SE, 164).
In our mepsdsgn
data, we have sex, which is a binary variable. Let’s estimate the total expenditures between males and females in the MEPS Full-Year Consolidated data. To estimate the mean between two groups, we’ll need to use the svyby
function along with the svymean
function.
## Estimate the weight mean total expenditure for males and females
svyby(~totexp20, ~sex, mepsdsgn, svymean)
## sex totexp20 se
## 1 - Male 1 - Male 5861.278 243.5624
## 2 - Female 2 - Female 6654.998 205.0776
The average total expenditures for male and female are $5861 (SE, 244) and $6655 (SE 205), respectively.
We can perform crosstabulations with the svytable
function. Let’s look at the distribution of males and females across various poverty categories. In the MEPS codebook, poverty category are groups as: 1 = Poor/Negative, 2 = Near Poor, 3 = Low Income, 4 = Middle Income, and 5 = High Income.
## Crosstab sex and poverty category
svytable(~sex + povcat20, design = mepsdsgn)
## povcat20
## sex 1 2 3 4 5
## 1 - Male 16644995 5955576 18910001 45083571 74366846
## 2 - Female 21002826 6672752 21753502 47750327 70404901
To generate the proportions, you will need to use prop.table
. We add the margin = 1
option to calculate the column total.
prop.table(svytable(~sex + povcat20, design = mepsdsgn), margin = 1) ### margin = 1 calculates the column total.
## povcat20
## sex 1 2 3 4 5
## 1 - Male 0.10341012 0.03700012 0.11748189 0.28009005 0.46201783
## 2 - Female 0.12532693 0.03981729 0.12980632 0.28493316 0.42011631
We can combine these into a contingency table using the tbl_svysummary
function from the gtsummary
package. We will also use the tidyverse
package to manipulate the data more easily.
## Load libraries
library("tidyverse")
library("gtsummary")
## Contingency table (crosstabulations between sex and poverty category)
%>%
mepsdsgn tbl_svysummary(by = sex, percent = "column", include = c(povcat20))
Characteristic | 1 - Male, N = 160,960,9891 | 2 - Female, N = 167,584,3081 |
---|---|---|
FAMILY INC AS % OF POVERTY LINE - CATEGORICAL | ||
1 | 16,644,995 (10%) | 21,002,826 (13%) |
2 | 5,955,576 (3.7%) | 6,672,752 (4.0%) |
3 | 18,910,001 (12%) | 21,753,502 (13%) |
4 | 45,083,571 (28%) | 47,750,327 (28%) |
5 | 74,366,846 (46%) | 70,404,901 (42%) |
1 n (%) |
Based on these weighted sample numbers, there are more males who are in the High Income category compared to females (46% versus 42%).
3.5 Conclusions
The MEPS data uses weights to generate estimations that are reflective of the US population. The survey
package from R will allow us to apply these weights using the svydesign
function, which requires us to enter the patient weight, stratum, and cluster values. Once these are applied, we can use the suite of functions from the survey
package to perform descriptive analysis on the population. The svymean
generates the population average and the svytable
generates the population frequencies. Having a good understanding of how these weights are used with MEPS data will allow you to generate estimates of the population in your epidemiology work.
3.6 Acknowledgements
The survey
package and functions were developed by Thomas Lumley and can be found here.
The gtsummary
package and instructions are developed by Daniel D. Sjoberg, Joseph Larmarange, Michael Curry, Jessica Lavery, Karissa Whiting, Emily C. Zabor, which can be found at their website.
This is a work in progress, and I expect to make updates in the future.