readme.txt by Jesse Rothstein January 17, 2017 This file serves as documentation for the Stata programs used to implement the analysis in: Rothstein, Jesse (2017). Revisiting the Impact of Teachers. January, 2017. The programs are written to use the "project.ado" command, written by Robert Picard, to manage dependencies and ensure that all results are up-to-date. With all files installed in the appropriate locations, this allows the entire project to be executed via a single Stata command, "project cfr_nc, build" --------------------------------------- DISTRIBUTION ARCHIVE CONTENTS --------------------------------------- The "manifest.txt" file includes a full listing of the contents of the distribution archive (before programs are zipped into the "programarchive.zip" file). The "programstructure.txt" file describes how the different programs relate to each other, as well as the directory structure in which they are placed. The distribution archive includes only the files listed as "original" and "relies_on"; other files, listed as "creates" or "uses" are created by these programs and are not distributed here. Even among the first group of files, only programs and documentation files are included; the raw data files (those in the /accounts/projects/nceduc/data/nceduc/ directory tree) are restricted-use and must be obtained from the North Carolina Education Research Data Center at Duke University via a data use license. I am happy to assist readers interested in obtaining them, but I do not control access. The "validate.txt" file lists file names, change dates, and checksums for all files. Note that checksums are not very robust in Stata -- users who obtain their own copies of the raw data from NCERDC are likely to wind up with files with different checksums, and even with the same data files the checksums for .log and .gph files encode the exact date and time that the file was created, so are not replicable. Users hoping to use this file to assist in replication should focus on the table output (.csv and .txt files) and the .eps versions of the figures, for which checksums should be more robust. The "programarchive.zip" file includes a zipped copy of all of the programs, in the appropriate directory structure. All files listed in manifest.txt other than the three above (manifest.txt, programstructure.txt, and validate.txt) are included in this .zip file. --------------------------------------- DOCUMENTATION OF PROGRAMS AND STRUCTURE --------------------------------------- The master program is "cfr_nc.do." This executes the subsidiary programs. The top-level subsidiary programs are: Group 1: Preparatory programs to read raw North Carolina data and get it into shape. These are written by Jesse Rothstein. 1.1) prepdata/prepdata_students.do 1.2) prepdata/prepdata_SAR.do 1.3) prepdata/validteachers.do 1.4) prepdata/prepdata_mbuild.do 1.5) prepdata/prepdata_longrun.do Group 2: Programs to create data files of the form, and with the variables, needed for the CFR analysis. These are adapted by Jesse Rothstein, from the original file "create_schd_work_final.do" distributed as part of the CFR replication archive. 2.1) Make a basic student-level dataset with all controls cfr/prepcfrsamp_1.do 2.2) Define sample (still at student level) and sample counts cfr/prepcfrsamp_2.do 2.3) Create various aggregated (school-grade, teacher-year, etc.) samples cfr/prepcfrsamp_3.do 2.4) Make file for long-run analysis (equivalent of CFR IRS files cfr/prepcfrsamp_4.do Group 3: Replication of CFR's analysis. These programs are adapted by Jesse Rothstein from the original file "analysis_final.do," distributed as part of the CFR replication archive. Line numbers below refer to the corresponding lines in that program. Note that I adapt much less of the second half of CFR's program, pertaining to CFR-II, as I do not have the same long-run outcome data so the changes would be quite dramatic. The "longrun" programs below conduct an exercise analogous to CFR's analysis of long-run outcomes. Many of these programs call or otherwise draw upon other programs also distributed via the CFR replication archive. These programs are included in the cfr/cfrprograms directory. In a couple of cases, it was necessary to make minor edits to CFR's original programs. Examples are "vam.ado" and "prepare_quasi.ado." These edits are indicated by comments in those programs. All of the CFR subsidiary programs are included in each of the "prepcfrsamp" and "cfr_analysis_final" programs via "relies_on" commands, so that any edits to any of the subsidiary files will cause "project" to rerun all of the group 2 and group 3 programs. 3.1) Make teacher-class weights and school-grade-level means (CFR program lines 95-355) cfr/cfr_analysis_final_pt1.do 3.2) Construct value-added estimates and forecasts (CFR program lines 361-513) cfr/cfr_analysis_final_pt2.do 3.3) Report autocovariance of VA at various lags (CFR-A, Table 2) (CFR program lines 517-599) cfr/cfr_analysis_final_pt3.do 3.4) Create "cross-class" analysis sample (CFR program lines 601-812) cfr/cfr_analysis_final_pt4.do 3.5) Summary statistics, observational regressions and bin scatter plots (CFR program lines 815-1200) (CFR-A, Table 3 & Fig 2A) cfr/cfr_analysis_final_pt5.do 3.6) Prepare and implement quasi-experimental analysis (CFR-A, Table 4, Fig 4A-B, & Fig 5B) (CFR program lines 1203-1393) cfr/cfr_analysis_final_pt6.do 3.7) Quasi-experimental analysis robustness (CFR-A, Table 5) (CFR program lines 1396-1495) cfr/cfr_analysis_final_pt7A.do 3.8) Quasi-experimental analysis robustness (CFR-A, Table 6) (CFR program lines 1498-1721) cfr/cfr_analysis_final_pt7B.do 3.9) Association between student characteristics and teacher VA (CFR-A, appendix table 2) (CFR program lines 1724-1838) cfr/cfr_analysis_final_pt8.do 3.10) Teacher switcher event studies, misc other (CFR-A, Figures 1A, 3, A-1A) (CFR program lines 1841-2671) cfr/cfr_analysis_final_pt9.do 3.11) Long-run "cross-class" analysis (CFR-B, Tables 1-2) (CFR program lines 2671-2821) cfr/cfr_analysis_final_pt10.do 3.12) Long-run "quasi-experimental" analysis (CFR-B, Table 5) (CFR program lines 2987-3067) cfr/cfr_analysis_final_pt11.do Group 4: Replication of CFR's analysis. These programs are by Jesse Rothstein, and are adapted from CFR's programs. They are designed to reproduce the primary CFR-I results in a more linear fashion than does the "analysis_final.do" program (and its adaptation here as "cfr_analysis_final"). 4.1) replication/basicreplication.do 4.2) replication/cfr1_table5.do 4.3) replication/cfr1_fig1a.do 4.4) replication/summstats.do Group 5: These programs reproduce the CFR analysis and extend it in various ways. They are not closely adapted from CFR's programs, for the most part, but implement the logic of the CFR analysis in my own code. Numerous checks ensure that results are the same as in the cfr_analysis_final analyses. 5.1) extension/demogpredict.do 5.2) extension/classlevel.do 5.3) extension/imputations.do 5.4) extension/preparequasi.do Analysis of different types of moves (between grades, between schools, etc.). These are not needed for the results in the paper, except that "movetypes.do" identifies followers used in apptab_leavenout.do, below. 5.5) extension/movetypes.do 5.6) extension/movetypes_quasi.do 5.7) extension/movetypes_analysis.do 5.8) Reproduce basic specifications extension/table1.do 5.9) Falsification test extension/table2.do 5.10) Quasiexperiment with controls extension/table3.do 5.11) Sorting by demographics extension/demogsorting.do 5.12) extension/vingtileplots.do Extra analyses, for appendix & extensions 5.13) Robustness to excluding followers, leaving n out, etc. extension/apptab_leavenout.do 5.15) extension/apptab_altimpute.do Group 6: These programs conduct the analysis of long-run outcomes parallel to CFR-II. 6.1) longrun/makesamp_observational.do 6.2) longrun/makesamp_quasi.do 6.3) longrun/observationalregs.do 6.4) longrun/quasiregs.do Group 7: A few extras not included elsewhere. 7.1) replication/paperstatistics.do 7.2) extension/cfrresponse_sims_t2.do --------------------------------------- REPLICATION INSTRUCTIONS --------------------------------------- Those hoping to replicate my analysis will need to follow several steps: 1) Obtain a license for the NCERDC data, and convert the SAS files to Stata format. These files should be stored in directory locations indicated in the "programstructure.txt" file (or the programs should be edited to point to the correct directory locations). 2) Unzip the programarchive.zip file into a fresh directory that will be the project home. 3) Create a "scratch" directory within the project home directory. (This can be a pointer to a temporary drive, assuming that it has enough space -- about 15GB in total.) Within this directory, create several subdirectories: home/scratch/cfrdata home/scratch/cfrdata/eventstudy home/scratch/cfrdata/tfx home/scratch/cfrdata/tfx/table6 home/scratch/extension home/scratch/longrun home/scratch/ncdata home/scratch/replication 4) Unzip the extras/project.zip and extras/estout.zip, and install the .ado files in your version of Stata (or simply install project and estout from the SSC archives). 5) In a graphical user interface instance of Stata, type "project, setup". In the resulting dialog box, navigate to the "cfr_nc.do" program in the project home directory. 6) Execute the project by typing "project cfr_nc, build" in Stata. (You may need to explicitly set the environmental variable "STATATMP" before executing Stata, if on your system (like mine) the default STATATMP directory does not have enough free space to hold all of the temporary files created by the project. Directions for this are in cfr_nc.do.)