Stata is a statistical software the is used to perform statistical analysis. It is one of my favorite statistical software, and I use it for many of my projects. Stata also has a powerful graphic editing tool that allows for a full range of customization. However, I like to keep all of these in the code so that I can have a convenient template to re-create my custom figures.
One figure that I generate often in Stata is the connected line plot,
whch uses the connected
command. Here is an example of what
that looks like. Notice how each point is connected by a line. Each
point represents the average price for each category of
rep78
. However, if I wanted to add the 95% confidence
interval error bars, I will need to do more coding.
To add 95% CI to the Stata connected
line plot, we will
use the Medical Expenditure
Panel Survey (MEPS) from the Agency for Healthcare Research and
Quality (AHRQ). I have already downloaded and cleaned the data from this
tutorial, which is available on my Dropbox
folder. Make sure to download the data and save it on the folder
where you are going to perform this exercise.
In this tutorial, we are interested in plotting the average total healthcare expenditures along with the 95% CI of responders from 2008 to 2019.
First, we will need to load the data into Stata. The data we will use
for this tutorial is adjusted_combined_data_post.dta
.
// Use the data from the following Dropbox folder:
* For Windows:
clear all
cd "C:\Users\mbounthavong\Dropbox\Marks blog\Stata in R Markdown"
use adjusted_combined_data_post.dta
C:\Users\mbounthavong\Dropbox\Marks blog\Stata in R Markdown
Let’s look at the number of subjects we have across the years. There are twelve years of data from 2008 - 2019.
. tab year, m
year | Freq. Percent Cum.
------------+-----------------------------------
2008 | 23,183 7.84 7.84
2009 | 26,008 8.79 16.63
2010 | 23,434 7.92 24.56
2011 | 25,226 8.53 33.09
2012 | 27,820 9.41 42.50
2013 | 26,316 8.90 51.39
2014 | 24,899 8.42 59.81
2015 | 25,577 8.65 68.46
2016 | 25,200 8.52 76.98
2017 | 23,529 7.96 84.94
2018 | 22,811 7.71 92.65
2019 | 21,722 7.35 100.00
------------+-----------------------------------
Total | 295,725 100.00
We can create a two-way line plot using the connected
command. We’ll plot he average total healthcare expenditure again the
years (2008-2019).
We estimate the mean total healthcare expenditure using the
egen
command by year. Then we use the
graph twoway connected
code to generate the two-way line
plot. Before you plot the figure, make sure that you sort by year.
Otherwise, you will be a spaghetti-like plot where Stata will try to
draw the line across the years if they are not sorted or in order.
. egen mean_expenditure = mean(totexp), by(year)
. sort year
. graph twoway connected mean_expenditure year, ///
> title("Total healthcare expenditures (2008-2019)") ///
> ytitle("Average total healthcare expenditure ($)") ///
> xtitle("Time (year)") ///
> xlab(2008(1)2019) ///
> ylab(, labsize(small) nogrid) ///
> graphregion(color(white)) ///
> bgcolor(white)
. graph export "connected1.png", replace
file connected1.png saved as PNG format
So far, our plot only illustrates the average total healthcare expenditure across time. But we want to plot the average total healthcare expenditures with 95% confidence intervals (CI). The 95% CI is important because it will give us some visual indicator of the variance surrounding each value.
We will need to estimate the 95% CI for each value. We can get the
95% CI for each value by using the ci means
command.
. egen mean_expenditure = mean(totexp), by(year)
. ci means mean_expenditure
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
mean_expen~e | 295,725 5757.172 1.967669 5753.315 5761.028
This gives us the 95% CI for the pooled total healthcare expenditure.
If we want to get the 95% CI for each year, we need to specify the year.
We use the forvalues
loop to cycle through the years and
estimate the mean, standard error, and 95% CI for the total healthcare
expenditure at each year (2008-2019).
. forvalues i = 2008/2019 {
2. ci mean totexp if year == `i'
3. }
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 23,183 4777.358 80.79093 4619.002 4935.713
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 26,008 5343.664 89.85267 5167.548 5519.78
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 23,434 5043.94 91.67762 4864.246 5223.634
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 25,226 5029.757 140.6383 4754.098 5305.416
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 27,820 4688.955 89.59223 4513.35 4864.56
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 26,316 5007.205 89.70101 4831.386 5183.024
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 24,899 5477.92 101.46 5279.053 5676.788
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 25,577 5856.111 112.1399 5636.31 6075.911
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 25,200 5886.41 101.2013 5688.049 6084.77
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 23,529 6645.628 122.1442 6406.217 6885.038
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 22,811 7749.712 134.2076 7486.656 8012.768
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 21,722 8187.724 142.8591 7907.709 8467.738
We start by adding two new variables: high
and
low
, which are the high and low ends of the 95% CI. Then we
estimate the 95% CI for the total healthcare expenditure at each year.
The data are stored in Stata and can be retrieved using the
r()
syntax. For example, if I want the lower bound of the
95% CI in 2009, then I write
replace low = r(lb) if year == 2009
.
To make this process more efficient, we use the
forvalues
loop for each year (2008-2019).
. // Generate the upper and low bound of the 95% CI variable
. gen high = .
(295,725 missing values generated)
. gen low = .
(295,725 missing values generated)
.
. // Loop commmand
. forvalues i = 2008/2019 {
2. ci mean totexp if year == `i'
3. replace high = r(ub) if year == `i'
4. replace low = r(lb) if year == `i'
5. }
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 23,183 4777.358 80.79093 4619.002 4935.713
(23,183 real changes made)
(23,183 real changes made)
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 26,008 5343.664 89.85267 5167.548 5519.78
(26,008 real changes made)
(26,008 real changes made)
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 23,434 5043.94 91.67762 4864.246 5223.634
(23,434 real changes made)
(23,434 real changes made)
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 25,226 5029.757 140.6383 4754.098 5305.416
(25,226 real changes made)
(25,226 real changes made)
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 27,820 4688.955 89.59223 4513.35 4864.56
(27,820 real changes made)
(27,820 real changes made)
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 26,316 5007.205 89.70101 4831.386 5183.024
(26,316 real changes made)
(26,316 real changes made)
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 24,899 5477.92 101.46 5279.053 5676.788
(24,899 real changes made)
(24,899 real changes made)
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 25,577 5856.111 112.1399 5636.31 6075.911
(25,577 real changes made)
(25,577 real changes made)
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 25,200 5886.41 101.2013 5688.049 6084.77
(25,200 real changes made)
(25,200 real changes made)
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 23,529 6645.628 122.1442 6406.217 6885.038
(23,529 real changes made)
(23,529 real changes made)
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 22,811 7749.712 134.2076 7486.656 8012.768
(22,811 real changes made)
(22,811 real changes made)
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 21,722 8187.724 142.8591 7907.709 8467.738
(21,722 real changes made)
(21,722 real changes made)
Now, that we have the 95% CI for total healthcare expenditure at each
year, we can plot these using the rcap
command. This allows
us to create error bars using the variables we generated for the upper
high
and lower low
bounds of the 95% CI.
// Generate the average total healthcare expenditure by year
egen mean_expenditure = mean(totexp), by(year)
// Generate the upper and low bound of the 95% CI variable
gen high = .
gen low = .
// Loop commmand
forvalues i = 2008/2019 {
ci mean totexp if year == `i'
replace high = r(ub) if year == `i'
replace low = r(lb) if year == `i'
}
// Sort by year and generate the two-way line plot
sort year
graph twoway (rcap low high year) ///
(connected mean_expenditure year, color(navy) msize(small) ///
title("Total healthcare expenditures (2008-2019)") ///
ytitle("Average total healthcare expenditure ($)") ///
xtitle("Time (year)") ///
xlab(2008(1)2019, labsize(vsmall)) ///
ylab(, labsize(vsmall) nogrid) ///
graphregion(color(white)) ///
bgcolor(white) ///
color(navy))
graph export "connected_errorbars.png", replace
(295,725 missing values generated)
(295,725 missing values generated)
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 23,183 4777.358 80.79093 4619.002 4935.713
(23,183 real changes made)
(23,183 real changes made)
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 26,008 5343.664 89.85267 5167.548 5519.78
(26,008 real changes made)
(26,008 real changes made)
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 23,434 5043.94 91.67762 4864.246 5223.634
(23,434 real changes made)
(23,434 real changes made)
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 25,226 5029.757 140.6383 4754.098 5305.416
(25,226 real changes made)
(25,226 real changes made)
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 27,820 4688.955 89.59223 4513.35 4864.56
(27,820 real changes made)
(27,820 real changes made)
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 26,316 5007.205 89.70101 4831.386 5183.024
(26,316 real changes made)
(26,316 real changes made)
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 24,899 5477.92 101.46 5279.053 5676.788
(24,899 real changes made)
(24,899 real changes made)
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 25,577 5856.111 112.1399 5636.31 6075.911
(25,577 real changes made)
(25,577 real changes made)
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 25,200 5886.41 101.2013 5688.049 6084.77
(25,200 real changes made)
(25,200 real changes made)
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 23,529 6645.628 122.1442 6406.217 6885.038
(23,529 real changes made)
(23,529 real changes made)
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 22,811 7749.712 134.2076 7486.656 8012.768
(22,811 real changes made)
(22,811 real changes made)
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
totexp | 21,722 8187.724 142.8591 7907.709 8467.738
(21,722 real changes made)
(21,722 real changes made)
file connected_errorbars.png saved as PNG format
Adding the 95% CI allows the reader to visualize the variance surrounding the mean total healthcare expenditure per year. This added dimension to the two-way line plot is useful for publication-quality images. However, these series of commands can be used for any continuous data type. I would recommend using this for data that is plotted along time on the x-axis.
I’ve learned how to code in Stata through various resources online. Particularly, Stats Exchange and Stata Forum.
This is a work in progress, and I will likely update this as I learn more efficient methods to produce 95% CI around average values.
The Github code for this R Markdown tutorial is located here.