Chapter 8 Stata replication code of Hellerstein (1998)

The following Stata program replicates the empirical analyses of Hellerstein, Judith K. 1998. “The Importance of the Physician in the Generic versus Trade-Name Prescription Decision,” The RAND Journal of Economics 29(1):108–36, doi: 10.2307/2555818.

This file contains program code for data preparation, replication of descriptive tables and figures, and replication of the empirical analyses and tables. The structure of this file is as follows:

Preparing the NAMCS and NAMCSd data (Section 4)
Descriptive statistics using NAMCS and NAMCSd data (Section 5)
Empirical analyses using NAMCSd data (Section 5 and Section 6)

8.1 Preparing the analysis data set

* Pathing and setup
log close _all
snapshot erase _all
clear all

*storage of NAMCsd raw data
global r                `"  "C:\Users\my name\my project\raw data\" "'

*analysis data
global d                `"  "C:\Users\my name\my project\data\" "'


*results
global tab          `"  "C:\Users\my name\my project\tables\" "'
global fig      `"  "C:\Users\my name\my project\figures\" "'


*Data were downloaded from https://ftp.cdc.gov/pub/Health_Statistics/NCHS/namcs_public_use_files/*
*1991 datasets and documentation: namcs91.exe; namcs91d.exe; namcs91doc.pdf*

*   Preparing NAMCS91:
******************************
use "$r\namcs91_raw.dta" , clear

* 1) Set up relevant dummies
gen female = sex == 1
gen nonwhite = race != 1
gen hispanic = ethnicity == 1
gen northeast = geo_reg == 1
gen midwest = geo_reg == 2
gen south = geo_reg == 3
gen west = geo_reg == 4
//Hellerstein defines specialists as 
//physicians who are not in general practice, family practice, or basic pediatrics.
gen specialist = (phys_special != "GP" & phys_special != "FP" & phys_special != "PD") 
//Medicare patients 
gen temp = selfpay + medicaid + other_gov_ins + private_ins + hmo_pre_paid
replace medicare = 0 if temp != 0  


* 2) Drop observations with unknown payment
gen temp1 = selfpay + medicaid  + medicare + other_gov_ins + private_ins + hmo_pre_paid
keep if temp1 == 1

compress
save "$d\namcs91.dta", replace




*   Preparing NAMCS91d:
****************************
use "$r\namcs91d_raw.dta" , clear


* 1) Removing observations with missing values:
***********************************************
replace generic_id = .b if generic_id == 50000
replace generic_status = .b if generic_status == 3
replace prescription_status = .b if prescription_status == 3
replace composition_status = .b if composition_status == 3 | composition_status == 6

drop if missing(generic_id)|missing(generic_status)|///
  missing(prescription_status)|missing(composition_status)

 
* 2) Creating required variables:
**********************************
** 2.1) Define 8 largest drug classes, see Table 3 and page 79 in Hellerstein's paper
gen major_drug_class = ///
    drug_class == 3 | drug_class == 5 | drug_class == 6 | drug_class == 10 | ///
    drug_class == 12 | drug_class == 15 | drug_class == 17 | drug_class ==  19


** 2.2) Creating some dummies
gen female = sex == 1
gen nonwhite = race != 1
gen hispanic = ethnicity == 1
gen northeast = geo_reg == 1
gen midwest = geo_reg == 2
gen south = geo_reg == 3
gen west = geo_reg == 4

gen specialist = (phys_special != "GP" & phys_special != "FP" & phys_special != "PD") 
gen temp = selfpay + medicaid + private_ins + hmo_pre_paid
replace medicare = 0 if temp != 0

replace generic_status = 0 if generic_status == 2
replace prescription_status = 0 if prescription_status == 2

tab drug_class, gen (drug_class_)
label variable drug_class_3 "Antimicrobial"
label variable drug_class_5 "Cardiovascular-renals"
label variable drug_class_6 "Central Nervous System"
label variable drug_class_10 "Hormones"
label variable drug_class_12 "Skin/Mucous Membranes"
label variable drug_class_15 "Ophthalmics"
label variable drug_class_17 "Pain Relief"
label variable drug_class_19 "Respiratory Test"


** 2.3) Create multisource indicator
*** 2.3.1) Create id for drugs with multiple ingrediants
replace ingredients = generic_id if ingredients == .

*** 2.3.2) Check whether there are both generic and trade-name drugs for each ingredients
bys ingredients: egen counter = mean(generic_status) 
bys ingredients: gen multisource = (counter >0 & counter <1)
drop counter


** 2.4) Set up physician averages
foreach x in age female nonwhite hispanic medicare medicaid hmo_pre_paid private_ins{
    cap: drop mean_`x'
    bys physician_id: egen mean_`x' = mean(`x')
}


* 3) Keeping relevant variables
****************************
//Dropping conditions
keep if prescription_status == 1 & multisource == 1 & major_drug_class == 1 & order_of_drug == 1  
* (selfpay == 1 | medicaid == 1 | private_ins == 1 | hmo_pre_paid == 1 | medicare == 1)
gen temp1 = selfpay + medicaid + medicare + private_ins + hmo_pre_paid
keep if temp1 == 1

keep ///
drug_id drug_name generic_id generic_name generic_status drug_class drug_class_* ingredients ///
age female nonwhite hispanic west northeast midwest south geo_reg specialist mean_* /// 
selfpay medicare medicaid private_ins hmo_pre_paid   ///
geo_reg phys_special phys_type physician_id patient_id

compress
save "$d\namcs91d.dta", replace

end of do-file

8.2 Descriptive statistics


*   Table 1:
****************

/* Comment:
Information was obtained from the footnote of Table 1, p. 115 of the original study.
The text stats that only uses data without missing values.
This applies to all tables and figures and was done in the initial data preparation.

The footnote defines the dummy variable "specialist":
whether a physician is not a general practice, family practice, or basic pediatrics.
The dummy medicare takes the value of 1 if medicare was the only source of payment.

Observations of patient visits in which the patient was not charged were excluded.
*/
  
  
use "$d\namcs91.dta", clear

qui eststo table1: estpost summarize age female nonwhite hispanic selfpay medicare medicaid ///
private_ins other_gov_ins  hmo_pre_paid specialist northeast midwest south west

esttab table1 using "$tab\Table_1.rtf", replace  ///
title("Summary Statistics for Overall NAMCS Patient Sample") ///
cells ("mean(fmt(%12.2f)) sd(fmt(%12.2f))") nonumber noobs ///
coeflabels(age "Age" female "Female" nonwhite "Nonwhite" hispanic "Hispanic" ///
selfpay "Self-pay" medicare "Medicare" medicaid "Medicaid" private_ins "Private/commercial" ///
other_gov_ins "Other government insurance" hmo_pre_paid "HMO/prepaid plan" ///
  specialist "Specialist" ///
northeast "Northeast" midwest "Midwest" south "South" west "West" ) ///
addnote("Notes: Sample size is 32,407. For further notes see table 1 notes in Hellerstein (1998); ///
        Sample size differs as data from different years is used" "Data source: NAMSC91")




*   Table 2:
****************

/* Comment:
Variables and data preparing processes equivalent to Table 1.

Create a matrix resembling the table and store each value in its respective cell.
*/
use "$d\namcs91d.dta", clear


* 1) Saving all values in locals:
**********************************
* For all rows besides the last one
foreach x in age female nonwhite hispanic selfpay medicare medicaid  private_ins hmo_pre_paid specialist northeast midwest south west {
    
    * Computing mean and sd
    qui: sum `x'
    local `x'_m = r(mean)
    local `x'_std = r(sd)
    
    * Computing the share of generics for all rows exept for the row age
    if "`x'" != "age"{
        qui: sum  generic_status if `x' == 1
        local `x'_share = r(mean)
    }
    }
    
* Last row (Share of generics in the full sample)
qui: sum generic_status
local fullsample_share = r(mean)
    

* 2) Filling a matrix with the locals:
***************************************
mat input table2 = ( /// 
`age_m', `age_std' , . \ ///
`female_m', `female_std', `female_share' \ ///
`nonwhite_m', `nonwhite_std', `nonwhite_share' \ ///
`hispanic_m', `hispanic_std', `hispanic_share' \ ///
`selfpay_m', `selfpay_std', `selfpay_share' \ ///
`medicare_m', `medicare_std', `medicare_share' \ ///
`medicaid_m', `medicaid_std', `medicaid_share' \ ///
`private_ins_m', `private_ins_std', `private_ins_share' \ ///
`hmo_pre_paid_m', `hmo_pre_paid_std', `hmo_pre_paid_share' \ ///
`specialist_m', `specialist_std', `specialist_share' \ ///
`northeast_m', `northeast_std', `northeast_share' \ ///
`midwest_m', `midwest_std', `midwest_share' \ ///
`south_m', `south_std', `south_share' \ ///
`west_m', `west_std', `west_share' \ ///
. , . , `fullsample_share' ///
 )

 
* 3) Labelling and exporting:
**************************************
matrix rownames table2 = ///
"Age" "Female" "Nonwhite" "Hispanic" "Self-Pay" "Medicare" "Medicaid" "Private/Commercial" ///
"HMO/prepaid plan" ///
"Specialist" "Northeast" "Midwest" "South" "West" "Full sample"

matrix colnames table2 = "Mean" "Standard Deviation" "Proportion Generic"

putexcel set "$tab\Table_2.xlsx", replace
putexcel A1 = matrix(table2 ), names


*   Table 3:
****************
use "$d\namcs91d.dta", clear
return clear
ereturn clear

* Turn decimal numbers into percentage
gen All_drugs = generic_status * 100

* generic share for all drugs
eststo t10: quietly estpost tabstat All_drugs, ///
statistics(mean N) columns(statistics) listwise

* generic share by drug class
eststo t20: quietly estpost tabstat All_drugs, by(drug_class) ///
statistics(mean N) columns(statistics) listwise notot

esttab t10 t20 using "$tab\Table_3.rtf", replace ///
cells("count(fmt(%12.0f)) mean(fmt(%12.2f))") ///
title("Table 3 - Frequency of Generic Prescription by Drug Class") ///
collabels("Observations" "% Generics") noobs nonumber gaps refcat(3 "By drug class" , nolabel) ///
coeflabels(generic "All drugs" 3 "Antimicrobials" 5 "Cardiovascular-renals" ///
 6 "Central Nervous System" 10 "Hormones/Hormonal mechanisms" ///
 12 "Skin/Mucous membrane" 15 "Ophthalmics" 17 "Pain relief" 19 "Resperatory tract") 




*   Figure 1:
*****************
/* Comment:
Visualizations of kernel densities vary depending on the specifications. 
Two parameters define the density function: Bandwidth and whether/how to round the data.
Hellerstein does not provide information about the density function.
We  visually rule out that she rounds to the first decimal place.
There are fluctuations between the first decimal places.
Rounding to the third or higher decimal places produces much noisier curves.
Hence, we assume that she rounded to the second decimal place.
For the bandwidth and density function, there are no obvious clues.
That the reproduction relies on visual approximation of the original figure.
*/


* 1) Estimate the mean share of prescribed generics per physician
bys physician_id: egen temp2 = mean(generic_status)
gen generic_share = round(temp2, .01) 


* 2) Collapse the data so that one observation per physician remains
duplicates drop physician_id, force


* 3) Recreate Figure 1
set scheme  s1manual
twoway kdensity generic_share, range(0 1) bw( 0.02) kernel(bi)  ///
xtitle("Physician generic prescription rates") ytitle("Percent of physicians") ///
xlabel(0 (0.1) 1) lc(black)

graph export "$fig\Figure_1.png", replace



*   Figure 2:
*****************
use "$d\namcs91d.dta", clear

/* Comment:
The figure uses observations of drugs with > 6 prescriptions by a single physician. 
158 unique physicians remain in Hellerstein (1998).
We are left with 122 physicians (see p. 116 first paragraph).
*/


* 1) Apply dropping conditions
bys physician_id ingredient : gen temp = _N
drop if temp < 6
* Check the number of remaining physicians. 158 in Hellerstein
qui: tab physician_id
di r(r)


* 2) Create the categories for the bars
** 2.1) Avg generic prescription rate for each per physician
bys physician_id ingredients: egen temp2 = mean(generic_status) 

 * 2.2) Use the share to set up the groups
gen counter = .
replace counter = 0 if temp2 == 0 /* Never generic */
replace counter = 1 if temp2 > 0 & temp2 < 1 /* Sometimes */
replace counter = 2 if temp2 == 1 /* Always */

label define l 0 "Only trade names" 1 "Both versions" 2 "Only generics"
label values counter l


* 3) Collapse the data so that one observation per physician remains
duplicates drop physician_id, force


* 4) Recreating the figure
set scheme  s1manual
graph bar (count), over(counter) ///
ytitle("") intensity(60) lintensity(100) bar(1, color("black") fcolor("gs5")) ylab(, nogrid)

graph export "$fig\Figure_2.png", replace

invalid 'C' 
r(198);

end of do-file
r(198);

8.3 Empirical analyses and tables


* Setting up the regression input, equivalent to the notation in the paper

use "$d\namcs91d.dta", clear
xtset physician_id



gen G = generic_status

/* drug classes */
global C drug_class_3 drug_class_5 drug_class_6 drug_class_10 drug_class_12 drug_class_15 drug_class_19     

/* patient demographics */
global X age female hispanic nonwhite                                       

/* insurance */
global P medicare medicaid hmo_pre_paid private_ins                             

/* is phys a specialist */
global S specialist                                             

/* region */
global R midwest south west                                         

/* patient avgs */
global X_dash mean_age mean_female mean_nonwhite mean_hispanic                          

/* insurance avgs */
global P_dash mean_medicaid mean_medicare mean_private_ins mean_hmo_pre_paid                    

* M and T are not included due to missing state id




*   Table 4:
****************
return clear
ereturn clear
eststo clear

eststo table4: xtprobit G $C $X $P $S $X_dash $P_dash $R, re
eststo marginstable4: margins, dydx( $X $S $X_dash $P_dash $R) post

*Reporting Table 4
esttab table4 marginstable4 using "$tab\Table-4.rtf", replace wide ///
keep(_cons $X $S $X_dash $P_dash $R) stats(rho) noobs b(3) order(_cons) ///
coeflabels(_cons "Constant" age "Age" female "Female" nonwhite "Nonwhite" hispanic "Hispanic" ///
              specialist "Specialist" mean_age "Mean age" /// 
mean_female "Percent female" mean_nonwhite "Percent black" ///
  mean_hispanic "Percent Hispanic" mean_medicaid "Percent Medicaid" mean_medicare ///
"Percent Medicare" mean_private_ins "Percent private insured" ///
  mean_hmo_pre_paid "Percent HMO/prepaid"  northeast "Northeast" midwest ///
"Midwest" south "South" west "West") ///
  mtitles("Random-Effects Probit Coefficient" "% Change in Generic") nonumber addnote("Datasource: NAMSC91") 




*   Table 5:
****************
return clear
ereturn clear
eststo clear

eststo table5:   xtprobit G $C $X $P $S $R $X_dash $P_dash, re
eststo marginstable5: margins, dydx($C) post

*Reporting Table 5
esttab table5 marginstable5 using "$tab\Table-5.rtf", replace noobs wide keep($C) nonumber ///
coeflabels(drug_class_1 "Antimicrobials" drug_class_2 "Cardiovascular-renals"///
             drug_class_3 "Central Nervous System" drug_class_4 ///
"Hormones/Hormonal mechanisms" drug_class_5 "Skin/Mucous membrane" drug_class_6 "Ophthalmics" drug_class_8 "Resperatory tract") ///
addnote("Notes: The ommitted drug category is pain relief" "Datasource: NAMSC91") mtitles("Random-Effects Probit Coefficient" "% Change in Generic") 


*   Table 6:
****************
return clear
ereturn clear
eststo clear

foreach x in 3 5 6 10 12 15 17 19 {

* Estimating and saving the coefs
eststo c_`x': quietly xtprobit G $X $P $S $R $X_dash $P_dash  if drug_class_`x' == 1, re
* Estimating and saving the marginal effects (AME, see table footer)
eststo m_`x': qui margins, dydx($P) post
}


esttab c_3 m_3 c_5 m_5 c_6 m_6 c_10 m_10 c_12 m_12 c_15 m_15 c_17 m_17 c_19 m_19 using "$tab\Table6.rtf", replace ///
nostar nopar  keep($P)  noobs b(2)  order(medicaid medicare private_ins hmo_pre_paid) ///
refcat(medicare "Medicare" medicaid "Medicaid" hmo_pre_paid "HMO/Prepaid" private_ins "Private", nolabel) ///
coeflabels(medicare "Coefficient" medicaid "Coefficient" hmo_pre_paid "Coefficient" private_ins "Coefficient") ///
mtitle("Antimicrobial" "% change" "Cardiovasculars" "% change" "Central Nervous System" "% change" "Hormones" "% change" "Skin/Muchous Membranes" "% change" "Ophthalmics" ///
"% change" "Pain relief" "% change" "Respiratory Tract" "% change" ) nonumbers eqlabels(none) collabels(none) nonumbers compress


* Tables 7, 8 and 9 are not reproducible due to missing state id in the publicly available datasets

invalid 'C' 
r(198);

end of do-file
r(198);