readme.txt
by Jesse Rothstein
January 17, 2017

This file serves as documentation for the Stata programs used to implement the
analysis in:
    Rothstein, Jesse (2017). Revisiting the Impact of Teachers. January, 2017.

The programs are written to use the "project.ado" command, written by Robert
Picard, to manage dependencies and ensure that all results are up-to-date. With
all files installed in the appropriate locations, this allows the entire project
to be executed via a single Stata command, "project cfr_nc, build"


---------------------------------------
DISTRIBUTION ARCHIVE CONTENTS
---------------------------------------

The "manifest.txt" file includes a full listing of the contents of the 
distribution archive (before programs are zipped into the "programarchive.zip"
file). 

The "programstructure.txt" file describes how the different programs relate
to each other, as well as the directory structure in which they are placed.
The distribution archive includes only the files listed as "original" and 
"relies_on"; other files, listed as "creates" or "uses" are created by these
programs and are not distributed here. 

Even among the first group of files, only programs and documentation files 
are included; the raw data files (those in the 
/accounts/projects/nceduc/data/nceduc/ directory tree) are restricted-use and 
must be obtained from the North Carolina Education Research Data Center at 
Duke University via a data use license. I am happy to assist readers interested
in obtaining them, but I do not control access.

The "validate.txt" file lists file names, change dates, and checksums for all
files. Note that checksums are not very robust in Stata -- users who obtain
their own copies of the raw data from NCERDC are likely to wind up with files
with different checksums, and even with the same data files the checksums for
.log and .gph files encode the exact date and time that the file was created,
so are not replicable. Users hoping to use this file to assist in replication
should focus on the table output (.csv and .txt files) and the .eps versions 
of the figures, for which checksums should be more robust.

The "programarchive.zip" file includes a zipped copy of all of the programs,
in the appropriate directory structure. All files listed in manifest.txt other
than the three above (manifest.txt, programstructure.txt, and validate.txt) are
included in this .zip file.


---------------------------------------
DOCUMENTATION OF PROGRAMS AND STRUCTURE
---------------------------------------

The master program is "cfr_nc.do." This executes the subsidiary programs.

The top-level subsidiary programs are:

Group 1: Preparatory programs to read raw North Carolina data and get it into
  shape. These are written by Jesse Rothstein.
    1.1) prepdata/prepdata_students.do
    1.2) prepdata/prepdata_SAR.do
    1.3) prepdata/validteachers.do
    1.4) prepdata/prepdata_mbuild.do
    1.5) prepdata/prepdata_longrun.do

Group 2: Programs to create data files of the form, and with the variables,
  needed for the CFR analysis. These are adapted by Jesse Rothstein, from
  the original file "create_schd_work_final.do" distributed as part of the 
  CFR replication archive.
    2.1) Make a basic student-level dataset with all controls
     cfr/prepcfrsamp_1.do
    2.2) Define sample (still at student level) and sample counts   
     cfr/prepcfrsamp_2.do  
    2.3) Create various aggregated (school-grade, teacher-year, etc.) samples
     cfr/prepcfrsamp_3.do
    2.4) Make file for long-run analysis (equivalent of CFR IRS files
     cfr/prepcfrsamp_4.do

Group 3: Replication of CFR's analysis. These programs are adapted by
  Jesse Rothstein from the original file "analysis_final.do," distributed
  as part of the CFR replication archive. Line numbers below refer to the
  corresponding lines in that program.

  Note that I adapt much less of the second half of CFR's program, pertaining 
  to CFR-II, as I do not have the same long-run outcome data so the changes
  would be quite dramatic.  The "longrun" programs below conduct an exercise 
  analogous to CFR's analysis of long-run outcomes.    
  
  Many of these programs call or otherwise draw upon other programs also 
  distributed via the CFR replication archive. These programs are included
  in the cfr/cfrprograms directory. In a couple of cases, it was necessary to
  make minor edits to CFR's original programs. Examples are "vam.ado" and
  "prepare_quasi.ado." These edits are indicated by comments in those programs.
  All of the CFR subsidiary programs are included in each of the "prepcfrsamp"
  and "cfr_analysis_final" programs via "relies_on" commands, so that any edits
  to any of the subsidiary files will cause "project" to rerun all of the 
  group 2 and group 3 programs.
  
    3.1) Make teacher-class weights and school-grade-level means (CFR program 
         lines 95-355)
         cfr/cfr_analysis_final_pt1.do
    3.2) Construct value-added estimates and forecasts  (CFR program lines 
         361-513)
         cfr/cfr_analysis_final_pt2.do
    3.3) Report autocovariance of VA at various lags (CFR-A, Table 2) (CFR 
         program lines 517-599)
         cfr/cfr_analysis_final_pt3.do
    3.4) Create "cross-class" analysis sample (CFR program lines 601-812)
         cfr/cfr_analysis_final_pt4.do
    3.5) Summary statistics, observational regressions and bin scatter plots 
         (CFR program lines 815-1200) (CFR-A, Table 3 & Fig 2A)
         cfr/cfr_analysis_final_pt5.do
    3.6) Prepare and implement quasi-experimental analysis (CFR-A, Table 4, 
         Fig 4A-B, & Fig 5B) (CFR program lines 1203-1393)
         cfr/cfr_analysis_final_pt6.do
    3.7) Quasi-experimental analysis robustness (CFR-A, Table 5) (CFR program 
         lines 1396-1495)
         cfr/cfr_analysis_final_pt7A.do
    3.8) Quasi-experimental analysis robustness (CFR-A, Table 6) (CFR program 
         lines 1498-1721)
         cfr/cfr_analysis_final_pt7B.do
    3.9) Association between student characteristics and teacher VA (CFR-A, 
         appendix table 2) (CFR program lines 1724-1838)
         cfr/cfr_analysis_final_pt8.do
    3.10) Teacher switcher event studies, misc other (CFR-A, Figures 1A, 3, 
          A-1A) (CFR program lines 1841-2671)
          cfr/cfr_analysis_final_pt9.do
    3.11) Long-run "cross-class" analysis (CFR-B, Tables 1-2) (CFR program 
          lines 2671-2821)
          cfr/cfr_analysis_final_pt10.do
    3.12) Long-run "quasi-experimental" analysis (CFR-B, Table 5) (CFR 
          program lines 2987-3067)
          cfr/cfr_analysis_final_pt11.do

Group 4: Replication of CFR's analysis. These programs are by Jesse Rothstein, 
  and are adapted from CFR's programs. They are designed to reproduce the 
  primary CFR-I results in a more linear fashion than does the 
  "analysis_final.do" program (and its adaptation here as "cfr_analysis_final").
  
	  4.1) replication/basicreplication.do
    4.2) replication/cfr1_table5.do
    4.3) replication/cfr1_fig1a.do
    4.4) replication/summstats.do

Group 5: These programs reproduce the CFR analysis and extend it in various 
  ways. They are not closely adapted from CFR's programs, for the most part,
  but implement the logic of the CFR analysis in my own code. Numerous checks
  ensure that results are the same as in the cfr_analysis_final analyses.

    5.1) extension/demogpredict.do
    5.2) extension/classlevel.do
    5.3) extension/imputations.do
    5.4) extension/preparequasi.do

    Analysis of different types of moves (between grades, between schools, 
    etc.). These are not needed for the results in the paper, except that 
    "movetypes.do" identifies followers used in apptab_leavenout.do, below.
    5.5) extension/movetypes.do
    5.6) extension/movetypes_quasi.do
    5.7) extension/movetypes_analysis.do

    5.8) Reproduce basic specifications
         extension/table1.do
    5.9) Falsification test
         extension/table2.do
    5.10) Quasiexperiment with controls
          extension/table3.do
    5.11) Sorting by demographics
          extension/demogsorting.do
    5.12) extension/vingtileplots.do
    
    Extra analyses, for appendix & extensions
    5.13) Robustness to excluding followers, leaving n out, etc.
          extension/apptab_leavenout.do
    5.15) extension/apptab_altimpute.do
   
Group 6: These programs conduct the analysis of long-run outcomes parallel to
  CFR-II. 
    6.1) longrun/makesamp_observational.do
    6.2) longrun/makesamp_quasi.do
    6.3) longrun/observationalregs.do
    6.4) longrun/quasiregs.do

Group 7: A few extras not included elsewhere.   
    7.1) replication/paperstatistics.do
    7.2) extension/cfrresponse_sims_t2.do


---------------------------------------
REPLICATION INSTRUCTIONS
---------------------------------------

Those hoping to replicate my analysis will need to follow several steps:
 1) Obtain a license for the NCERDC data, and convert the SAS files to Stata
    format. These files should be stored in directory locations indicated 
    in the "programstructure.txt" file (or the programs should be edited
    to point to the correct directory locations).
 2) Unzip the programarchive.zip file into a fresh directory that will be the
    project home.
 3) Create a "scratch" directory within the project home directory. (This can 
    be a pointer to a temporary drive, assuming that it has enough space -- 
    about 15GB in total.)
    Within this directory, create several subdirectories:
      home/scratch/cfrdata
      home/scratch/cfrdata/eventstudy
      home/scratch/cfrdata/tfx
      home/scratch/cfrdata/tfx/table6
      home/scratch/extension
      home/scratch/longrun
      home/scratch/ncdata
      home/scratch/replication
 4) Unzip the extras/project.zip and extras/estout.zip, and install the .ado
    files in your version of Stata (or simply install project and estout from
    the SSC archives).
 5) In a graphical user interface instance of Stata, type "project, setup".
    In the resulting dialog box, navigate to the "cfr_nc.do" program in the
    project home directory.
 6) Execute the project by typing "project cfr_nc, build" in Stata. (You may
    need to explicitly set the environmental variable "STATATMP" before 
    executing Stata, if on your system (like mine) the default STATATMP 
    directory does not have enough free space to hold all of the temporary
    files created by the project. Directions for this are in cfr_nc.do.)