Introduction

Stata is a statistical software the is used to perform statistical analysis. It is one of my favorite statistical software, and I use it for many of my projects. Stata also has a powerful graphic editing tool that allows for a full range of customization. However, I like to keep all of these in the code so that I can have a convenient template to re-create my custom figures.

One figure that I generate often in Stata is the connected line plot, whch uses the connected command. Here is an example of what that looks like. Notice how each point is connected by a line. Each point represents the average price for each category of rep78. However, if I wanted to add the 95% confidence interval error bars, I will need to do more coding.

Objective

To add 95% CI to the Stata connected line plot, we will use the Medical Expenditure Panel Survey (MEPS) from the Agency for Healthcare Research and Quality (AHRQ). I have already downloaded and cleaned the data from this tutorial, which is available on my Dropbox folder. Make sure to download the data and save it on the folder where you are going to perform this exercise.

Motivating example

In this tutorial, we are interested in plotting the average total healthcare expenditures along with the 95% CI of responders from 2008 to 2019.

Step 1: Load the data into Stata

First, we will need to load the data into Stata. The data we will use for this tutorial is adjusted_combined_data_post.dta.

// Use the data from the following Dropbox folder:
* For Windows:
clear all 
cd "C:\Users\mbounthavong\Dropbox\Marks blog\Stata in R Markdown"
use adjusted_combined_data_post.dta
C:\Users\mbounthavong\Dropbox\Marks blog\Stata in R Markdown

Step 2: Visualize the data

Let’s look at the number of subjects we have across the years. There are twelve years of data from 2008 - 2019.

. tab year, m

       year |      Freq.     Percent        Cum.
------------+-----------------------------------
       2008 |     23,183        7.84        7.84
       2009 |     26,008        8.79       16.63
       2010 |     23,434        7.92       24.56
       2011 |     25,226        8.53       33.09
       2012 |     27,820        9.41       42.50
       2013 |     26,316        8.90       51.39
       2014 |     24,899        8.42       59.81
       2015 |     25,577        8.65       68.46
       2016 |     25,200        8.52       76.98
       2017 |     23,529        7.96       84.94
       2018 |     22,811        7.71       92.65
       2019 |     21,722        7.35      100.00
------------+-----------------------------------
      Total |    295,725      100.00

Step 3: Two-way line plot

We can create a two-way line plot using the connected command. We’ll plot he average total healthcare expenditure again the years (2008-2019).

We estimate the mean total healthcare expenditure using the egen command by year. Then we use the graph twoway connected code to generate the two-way line plot. Before you plot the figure, make sure that you sort by year. Otherwise, you will be a spaghetti-like plot where Stata will try to draw the line across the years if they are not sorted or in order.

. egen mean_expenditure = mean(totexp), by(year)

. sort year

. graph twoway connected mean_expenditure year, ///
>       title("Total healthcare expenditures (2008-2019)") ///
>       ytitle("Average total healthcare expenditure ($)") ///
>       xtitle("Time (year)") ///
>       xlab(2008(1)2019) ///
>       ylab(, labsize(small) nogrid) ///
>       graphregion(color(white)) ///
>       bgcolor(white)

. graph export "connected1.png", replace
file connected1.png saved as PNG format

Step 4: Estimate the 95% CI

So far, our plot only illustrates the average total healthcare expenditure across time. But we want to plot the average total healthcare expenditures with 95% confidence intervals (CI). The 95% CI is important because it will give us some visual indicator of the variance surrounding each value.

We will need to estimate the 95% CI for each value. We can get the 95% CI for each value by using the ci means command.

. egen mean_expenditure = mean(totexp), by(year)

. ci means mean_expenditure

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
mean_expen~e |    295,725    5757.172    1.967669        5753.315    5761.028

This gives us the 95% CI for the pooled total healthcare expenditure. If we want to get the 95% CI for each year, we need to specify the year. We use the forvalues loop to cycle through the years and estimate the mean, standard error, and 95% CI for the total healthcare expenditure at each year (2008-2019).

. forvalues i = 2008/2019 {
  2.         ci mean totexp if year == `i'
  3.   }

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     23,183    4777.358    80.79093        4619.002    4935.713

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     26,008    5343.664    89.85267        5167.548     5519.78

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     23,434     5043.94    91.67762        4864.246    5223.634

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     25,226    5029.757    140.6383        4754.098    5305.416

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     27,820    4688.955    89.59223         4513.35     4864.56

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     26,316    5007.205    89.70101        4831.386    5183.024

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     24,899     5477.92      101.46        5279.053    5676.788

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     25,577    5856.111    112.1399         5636.31    6075.911

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     25,200     5886.41    101.2013        5688.049     6084.77

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     23,529    6645.628    122.1442        6406.217    6885.038

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     22,811    7749.712    134.2076        7486.656    8012.768

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     21,722    8187.724    142.8591        7907.709    8467.738

Step 5: Generate the high and low values of the 95% CI

We start by adding two new variables: high and low, which are the high and low ends of the 95% CI. Then we estimate the 95% CI for the total healthcare expenditure at each year. The data are stored in Stata and can be retrieved using the r() syntax. For example, if I want the lower bound of the 95% CI in 2009, then I write replace low = r(lb) if year == 2009.

To make this process more efficient, we use the forvalues loop for each year (2008-2019).

. // Generate the upper and low bound of the 95% CI variable
. gen high = .
(295,725 missing values generated)

. gen low = .
(295,725 missing values generated)

. 
. // Loop commmand
. forvalues i = 2008/2019 {
  2.         ci mean totexp if year == `i'
  3.         replace high = r(ub) if year == `i'
  4.         replace low = r(lb) if year == `i'
  5. }

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     23,183    4777.358    80.79093        4619.002    4935.713
(23,183 real changes made)
(23,183 real changes made)

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     26,008    5343.664    89.85267        5167.548     5519.78
(26,008 real changes made)
(26,008 real changes made)

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     23,434     5043.94    91.67762        4864.246    5223.634
(23,434 real changes made)
(23,434 real changes made)

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     25,226    5029.757    140.6383        4754.098    5305.416
(25,226 real changes made)
(25,226 real changes made)

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     27,820    4688.955    89.59223         4513.35     4864.56
(27,820 real changes made)
(27,820 real changes made)

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     26,316    5007.205    89.70101        4831.386    5183.024
(26,316 real changes made)
(26,316 real changes made)

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     24,899     5477.92      101.46        5279.053    5676.788
(24,899 real changes made)
(24,899 real changes made)

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     25,577    5856.111    112.1399         5636.31    6075.911
(25,577 real changes made)
(25,577 real changes made)

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     25,200     5886.41    101.2013        5688.049     6084.77
(25,200 real changes made)
(25,200 real changes made)

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     23,529    6645.628    122.1442        6406.217    6885.038
(23,529 real changes made)
(23,529 real changes made)

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     22,811    7749.712    134.2076        7486.656    8012.768
(22,811 real changes made)
(22,811 real changes made)

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     21,722    8187.724    142.8591        7907.709    8467.738
(21,722 real changes made)
(21,722 real changes made)

Step 6: Plot the 95% CI

Now, that we have the 95% CI for total healthcare expenditure at each year, we can plot these using the rcap command. This allows us to create error bars using the variables we generated for the upper high and lower low bounds of the 95% CI.

// Generate the average total healthcare expenditure by year
egen mean_expenditure = mean(totexp), by(year)


// Generate the upper and low bound of the 95% CI variable
gen high = .
gen low = .

// Loop commmand
forvalues i = 2008/2019 {
    ci mean totexp if year == `i'
    replace high = r(ub) if year == `i'
    replace low = r(lb) if year == `i'
}

// Sort by year and generate the two-way line plot
sort year
graph twoway (rcap low high year) ///
             (connected mean_expenditure year, color(navy) msize(small) ///
                        title("Total healthcare expenditures (2008-2019)") ///
                        ytitle("Average total healthcare expenditure ($)") ///
                        xtitle("Time (year)") ///
                        xlab(2008(1)2019, labsize(vsmall)) ///
                        ylab(, labsize(vsmall) nogrid) ///
                        graphregion(color(white)) ///
                        bgcolor(white) ///
                        color(navy))
graph export "connected_errorbars.png", replace
(295,725 missing values generated)

(295,725 missing values generated)

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     23,183    4777.358    80.79093        4619.002    4935.713
(23,183 real changes made)
(23,183 real changes made)

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     26,008    5343.664    89.85267        5167.548     5519.78
(26,008 real changes made)
(26,008 real changes made)

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     23,434     5043.94    91.67762        4864.246    5223.634
(23,434 real changes made)
(23,434 real changes made)

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     25,226    5029.757    140.6383        4754.098    5305.416
(25,226 real changes made)
(25,226 real changes made)

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     27,820    4688.955    89.59223         4513.35     4864.56
(27,820 real changes made)
(27,820 real changes made)

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     26,316    5007.205    89.70101        4831.386    5183.024
(26,316 real changes made)
(26,316 real changes made)

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     24,899     5477.92      101.46        5279.053    5676.788
(24,899 real changes made)
(24,899 real changes made)

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     25,577    5856.111    112.1399         5636.31    6075.911
(25,577 real changes made)
(25,577 real changes made)

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     25,200     5886.41    101.2013        5688.049     6084.77
(25,200 real changes made)
(25,200 real changes made)

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     23,529    6645.628    122.1442        6406.217    6885.038
(23,529 real changes made)
(23,529 real changes made)

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     22,811    7749.712    134.2076        7486.656    8012.768
(22,811 real changes made)
(22,811 real changes made)

    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
      totexp |     21,722    8187.724    142.8591        7907.709    8467.738
(21,722 real changes made)
(21,722 real changes made)



file connected_errorbars.png saved as PNG format

Conclusions

Adding the 95% CI allows the reader to visualize the variance surrounding the mean total healthcare expenditure per year. This added dimension to the two-way line plot is useful for publication-quality images. However, these series of commands can be used for any continuous data type. I would recommend using this for data that is plotted along time on the x-axis.

Acknowledgement

I’ve learned how to code in Stata through various resources online. Particularly, Stats Exchange and Stata Forum.

This is a work in progress, and I will likely update this as I learn more efficient methods to produce 95% CI around average values.

The Github code for this R Markdown tutorial is located here.