/*****************************************************************************/
/**********   Tips and Tricks Guide for Rent-sharing SAS coding   ************/
/*****************************************************************************/


/*****************************************************************************/
/********************   What data files are important   **********************/
/*****************************************************************************/

    *home (/bod/soi/t3gqb/patents/build/sas/):

        * quartile_variables_sorted.sas7bdat
          This is the main data file that we used with worker information in
          the final analysis sample. Also next to each variable is the original
          data set that the variable came from (see the section immediately
          following this one that discusses how to recreate
          quartile_variables_sorted for more information on those data sets).

            * tin (patent_eins_workers_bp_full)
                The tax identifcation number of the worker for this observation

            * tax_yr (patent_eins_workers_bp_full)
                The year of the associated observation, runs from 1996-2014 in
                the data.

            * payer_tin_w2max (patent_eins_workers_bp_full)
                The tax ID number of the firm at which the given employee
                worked in the associated tax year. Note that this variable is
                ``masked'', so it corresponds to the firm TINs in the larger
                ``databank_masked'' data set from which most of the data in
                quartile_variables_sorted is derived.

            * payer_tinXXXX (patent_eins_workers_bp_full)
                The firm TIN where thre associated worker for the observation
                worked in year XXXX. Note that given the construction of the
                data, the data set is effectively long and wide in the firm
                TIN. That is, you can access a worker's 2000 employer by
                looking at the payer_tin2000 variable in any year's observation
                or you could get that 2000 employer by looking at the
                payer_tin_w2_max value in the 2000 tax_yr observation.

            * wagesXXXX (patent_eins_workers_bp_full)
                Like payer_tinXXXX described above, this store the wages for
                the worker in year XXXX. Unlike the firms, there is currently
                no parallel to the payer_tin_w2_max variable which stores the
                associated year's wages on demand, so we construct that on each
                run.

            * gnd_ind (gender)
                The associated worker's listed gender. It's a character and one
                of 'M' (for male), 'F' (for female), or 'U' (for undefined).

            * ssa_yob (databank_masked)
                The year of birth for the person of interest

            * tent_tx (databank_masked)
                The tentative expected tax payment for the given worker in
                the given year.

            * inventor (unique_inventor)
                A variable storing whether the person is classified as an
                inventor. This takes on two values: 1 for an inventor; and
                missing for a non-inventor.

            * p25_v (processed from patent_eins_workers_bp_full data)
                The firm-year 25th wage percentile, missing if that firm had
                fewer than 4 employees that year.

            * p50_v (processed from patent_eins_workers_bp_full data)
                The firm-year 50th wage percentile, missing if that firm had
                fewer than 4 employees that year.

            * p75_v (processed from patent_eins_workers_bp_full data)
                The firm-year 75th wage percentile, missing if that firm had
                fewer than 4 employees that year.


    * decyr (/bod/soi/gzhrb/patents/):

        * tin_appnum_appyear_decyear.csv
          (this CSV file stores a link between firm and patent application
          numbers, application years, and decision years)

            * tin:
                The tax ID number of the firm, this variable is the same as
                quartile_variables_sorted's ``payer_tin_w2max'', just under a
                different name. It is NOT supposed to be merged with
                quartile_variables_sorted's tin variable, that tin variable
                representing workers (not firms).

            * appnum:
                The application number of the patent that the associated firm
                is applied for in our data.

            * applicationyear:
                The application year of the associated patent application in
                appnum.

            * decisionyear:
                The year of the first patent office decision (allowance/
                rejection) for the associated patent application in appnum

    * databank datasets:

        * databank_masked.sas7bdat
            Used to obtain spousal tins for an occupation coding project,
            or a handful of new variables like year of birth, for example.

        * mtrdb_f1040.sas7bdat
          mtrdb_f1040a.sas7bdat
          mtrdb_f1040ez.sas7bdat
            Used three individual tax filing forms to obtain occupations
            string names for the mean_occ_wages_v2.sas code, and the
            occ_pct_greedy_v6.sas code. We ended up not using occupations.



/*****************************************************************************/
/*****************   A note on quartile_variables_sorted   *******************/
/*****************************************************************************/


To facilitate your ability to replicate the data set moving forward, we provide
a continuous block of code below to recreate the quartile_variables_sorted data
set's contents. It requires the following data sets:

    * patent_eins_workers_bp_full.sas7bdat (`home' dir)
        This data set was created by Neviana Petkova, and contains the tin;
        tax_yr; payer_tin_w2_max; and yearly wage and payer TIN variables.

    * gender.sas7bdat (`home' dir)
        This data set contains the gender of the individual, and their worker
        TIN. Note that gender goes not vary at the year level, as a result.

    * unique_inventor.sas7bdat (`home' dir)
        This data set contains the inventor status of the individual, and their
        worker TIN. 

    * databank_masked.sas7bdat (`cdw' dir)
        From the `CDW'directory, this
        data set was use to fill in a bit of information on the age and expected taxes of an individual.


/***********   Begin code to construct quartile_variables_sorted   ***********/

libname home '/user/patents/build/sas/';

* get percentile cut-offs for quartile variables;
proc sql;
  create table wagedata as
  select payer_tin_w2_max, tax_yr,
         case tax_yr
           when 1996 then wages1996
           when 1997 then wages1997
           when 1998 then wages1998
           when 1999 then wages1999
           when 2000 then wages2000
           when 2001 then wages2001
           when 2002 then wages2002
           when 2003 then wages2003
           when 2004 then wages2004
           when 2005 then wages2005
           when 2006 then wages2006
           when 2007 then wages2007
           when 2008 then wages2008
           when 2009 then wages2009
           when 2010 then wages2010
           when 2011 then wages2011
           when 2012 then wages2012
           when 2013 then wages2013
                     else wages2014
         end as wages
  from home.patent_eins_workers_bp_full;
quit;

proc means data=wagedata noprint nway;
    class payer_tin_w2_max tax_yr;
    var wages;
    output out=quarts(keep=payer_tin_w2_max tax_yr _FREQ_ p25_v p50_v p75_v)
        p25=p25_v p50=p50_v p75=p75_v;
run;

data quarts(drop=_FREQ_);
  set quarts;
    if _FREQ_<4 then do;
      p25_v = .;
      p50_v = .;
      p75_v = .;
    end;
run;



*import extra databank vars data (make sure it is unique at tin-tax_yr level);
proc sort data=cdw.databank_masked(keep=tin tax_yr adjgross
                                        ssa_yob attend_college tent_tx)
          out=databank;
  by tin tax_yr descending adjgross;
run;

proc sort data=databank(keep=tin tax_yr ssa_yob attend_college tent_tx)
          out=databank nodupkey;
  by tin tax_yr;
run;



* merge all of the data together, then sort at the proper level;
proc sql;
  create table home.quartile_variables_sorted as
  select
         a.tin, a.tax_yr, a.payer_tin_w2_max,
         a.payer_tin1996-a.payer_tin2014,
         a.wages1996-a.wages2014,

         b.p25_v, b.p50_v, b.p75_v,

         c.gnd_ind,

         d.inventor,

         e.ssa_yob, e.attend_college, e.tent_tx

  from home.patent_eins_workers_bp_full as a

    left join quarts as b
      on (a.payer_tin_w2_max=b.payer_tin_w2_max and a.tax_yr=b.tax_yr)

    left join home.gender as c
      on (a.tin=c.tin)

    left join home.unique_inventor as d
      on (a.tin=d.tin)

    left join databank as e
      on (a.tin=e.tin and a.tax_yr=e.tax_yr)

  order by a.tin, a.tax_yr;

quit;

/************   End code to construct quartile_variables_sorted   ************/



/*****************************************************************************/
/*************************   Where my code lives   ***************************/
/*****************************************************************************/

Much of the SAS coding has been to take worker-level data (from 
quartile_variables_sorted) and convert it into some kind of firm-by-year
level aggregate data. 


Below each code files is a list of input and output files. 
The output CSV files are described in more detail in
the associated documentation files (i.e., the `.docx' files in the /docs folder). 
Below several entries are numbered, others having leading
asterisks. The order presented below is the order in which the code files were
run, and the numbered entries are ones that end up in the `make_kpwz.do' 
code file, while the asterisked ones current do not end up in the final run.
The order in which the code files were run is as
follows:

    1) quartile_wage_measures_20170428.sas
        input:
            - home.quartile_variables_sorted.sas7bdat
        output:
            - mean_quartile_wages.csv
            - mean_quartile_stayerwages.csv
            - quartile_sep_rate_cht.csv
            - quartile_sep_rate_entrant.csv


    2) sep_rates_w_pooled_20180509.sas
        input:
            - home.quartile_variables_sorted.sas7bdat
        output:
            - separators_pooled.csv
            - quartile_sep_rate.csv


    3) gnd_inv_styr_ent3_20170706.sas
        input:
            - home.patent_eins_workers_bp_full.sas7bdat
            - home.unique_inventor.sas7bdat
            - home.gender.sas7bdat
        output:
            - quartile_gender_inventor.csv
            - gender_inventor.csv
            - mean_quartile_stayer_currwages.csv
            - mean_quartile_entrant3s.csv
            - mean_entrant3s.csv


    4) cht_quartile_wages_and_ent_wage_growth_20170816.sas
        input:
            - home.quartile_variables_sorted.sas7bdat
        output:
            - cohort_appyr_q_wages.csv
            - entrant_wage_growth.csv


    5) cohort_wagegrowth_age_college_tenttax_20170822.sas
        input:
            - home.quartile_variables_sorted_XXXX.sas7bdat
                                 ( XXXX \in [1996, 2014] )
        output:
            - cohort_over40_wages.csv
            - cohort_under40_wages.csv
            - cohort_male_wages.csv
            - cohort_female_wages.csv
            - cohort_inventor_wages.csv
            - cohort_noninventor_wages.csv

    6) pred_qual_sep_wages.sas
        input:
            - home.quartile_variables_sorted.sas7bdat
        output:
            - med_expanded_qual_sep_vars.csv
            - med_baseline_qual_sep_vars.csv

    7) inv_wages_v2.sas
        input:
            - inv.app_inventor.sas7bdat
            - home.quartile_variables_sorted.sas7bdat
            - decyr.tin_appnum_appyear_decyear.csv
        output:
            - appnum_inventors_firmyear.csv

/*****************************************************************************/
/*********************   Common variable definitions   ***********************/
/*****************************************************************************/

    * Within firm wage quartiles
        The within-firm wage quartiles are constructed for firms with at least
        four employees, are defined as follows in the paper:
            Q1 : wages <= p25
            Q2 : p25 < wages <= p50
            Q3 : p50 < wages <= p75
            Q4 : p75 < wages

        We got those values of p25, p50, and p75 for each year with code that
        looked like the following:

            proc means data=wagedata noprint nway;
                class payer_tin_w2_max tax_yr;
                var wages;
                output out=temp(keep=payer_tin_w2_max tax_yr
                                     _FREQ_ p25_v p50_v p75_v)
                    p25=p25_v p50=p50_v p75=p75_v;
            run;

        We then merged that ``temp'' data set back onto the
        quartile_variables_sorted data for later use with the rest of the 
        data.


    * Male/Female indicators:
        The only thing to watch out for here is to not take male
        to be (1-female) or vice versa given that a worker's gender can be
        undefined. Work separately with gnd_ind=='M' and gnd_ind=='F' 
        observations when computing things based on gender.


    * Application cohort
        The application cohort is the set of workers for a firm who were
        employed by that firm in the application year. A consequence of that
        definition is that the same worker can belong to more than one firm's
        application cohort, for example if they worked for firm A in A's
        application year of 2005 and firm B in B's application year of 2008.

        When constructing variables related to the application year cohort,
        this is the sort of code that was written to prepare application year
        cohorts (note that `taad' is the tin_appnum_appyear_decyear.csv file
        imported into SAS format):

            proc sql;
                create table with_appyr as
                    select a.tin, a.tax_yr, a.payer_tin_w2_max, b.*,
                              case a.tax_yr
                                 when 1996 then wages1996
                                 when 1997 then wages1997
                                 when 1998 then wages1998
                                 when 1999 then wages1999
                                 when 2000 then wages2000
                                 when 2001 then wages2001
                                 when 2002 then wages2002
                                 when 2003 then wages2003
                                 when 2004 then wages2004
                                 when 2005 then wages2005
                                 when 2006 then wages2006
                                 when 2007 then wages2007
                                 when 2008 then wages2008
                                 when 2009 then wages2009
                                 when 2010 then wages2010
                                 when 2011 then wages2011
                                 when 2012 then wages2012
                                 when 2013 then wages2013
                                 when 2014 then wages2014
                              end as wages
                    from home.quartile_variables_sorted as a
                    left join /* IMPORTANT!!! */
                    taad as b
                    on a.payer_tin_w2_max = b.tin;

                create table appyr_cht as
                    select a.tin, a.tax_yr, a.wages,
                           b.payer_tin_w2_max, b.applicationyear,
                    from
                    with_appyr as a
                    inner join
                    (
                        select tin, applicationyear, payer_tin_w2_max
                        from with_appyr
                        where applicationyear=tax_yr
                    ) as b
                    on a.tin=b.tin;
            quit;

        Broken down, we first pre-process the quartile_variables_sorted data
        set to have the application years for each firm. As noted in the above
        code, it's important that we do a `left join' there so that we retain
        worker observations that might end up working at firms in given years
        not in the final analysis sample. This is necessary as a worker might
        be unemployed, or working for a different firm, but still belong in the
        application year cohort for any other firm that ends up in the final
        analysis sample.

        We then effectively in the second block that creates the `appyr_cht'
        table do a 1-to-many merge between the processed worker data from the
        first block, and a list of worker's and their application year cohort
        firms. This 1-to-many merge will duplicate each worker observation to
        allow for one worker being in several application year cohorts, and the
        process of selecting `b.payer_tin_w2_max' in that block will guarantee
        that the worker is associated with the correct firm's application year
        cohort  (instead of the firm where they might have been working in the
        given year, which would have been the case if we had selected
        `a.payer_tin_w2_max' for the table).


    * Stayers
        Firm stayers are employees in a given year who were present in the year
        of application. Note that the employee need not be continuously
        employed at the firm from application year to the given year (or vice
        versa), but merely employed at the same firm in both the current year
        and the application year. It's important to define stayers for both the
        pre- and post-decision year/application year period for the purposes of
        the difference-in-differences estimation strategy.

        This is the sort of code we have used to create firm stayers (after
        merging on the `taad' data set described above with the application
        year):

            %macro stayers;
            data final_sample;
                set qvs_with_taad;

                if applicationyear ne . then do;
                    stayer=.;
                    %do i = 2000 %to 2010;
                        if applicationyear=&i. and payer_tin_w2_max=payer_tin&i
                            then stayer=1;
                    %end;
                end;

                run;
            %mend;
            %stayers;


    * Separators
        Workers for a firm who left that firm the previous year. The tricky
        thing practically about this variable is that workers need to be 
        associated to their last year's firm, while keeping their current wages.

        For example, say that an employee works for firm A in 2006 and firm B
        in 2007. Then in 2007 they would be a firm A separator, and their 2007
        wages would contribute to firm A's separator wages. Another thing to make 
        sure of here is to count workers who separate into unemployment's zero 
        wages in any aggregates that are constructed.

        This is the sort of code that that was used to create firm separators:

            %macro separators;
                data separators;
                    set home.quartile_variables_sorted;

                        separator=.;
                        sepwages=.;
                        %do y = 1996 %to 2013;
                            %let ypl1 = %eval(&y. + 1);
                            if tax_yr = &y. then do;
                                if payer_tin_w2_max ne payer_tin&ypl1. then do;
                                    separator = 1;
                                    sepwages = coalesce(wages&ypl1.,0);
                                end;
                            end;
                        %end;

                        tax_yr = tax_yr+1;

                run;

            %mend;
            %separators;

    * Entrants
        Workers who were not working for the given firm last year. This
        variable is fairly straightforward, and the following is the sort of
        code that was used to create firm entrants:

            %macro entrants;
                data entrants;
                    set home.quartile_variables_sorted;

                        entrant=.;
                        %do y = 1997 %to 2014;
                            %let ym1 = %eval(&y. - 1);
                            if tax_yr = &y. then do;
                                if payer_tin_w2_max ne payer_tin&ym1.
                                    then entrant = 1;
                                end;
                            end;
                        %end;

                run;

            %mend;
            %entrants;


    * Recent entrants
        Workers who were not working for the given firm in any of the last
        three years. Someone is still counted as a recent entrant for a given
        firm even if they were working for the same firm with a gap period in
        between (e.g., works for firm A in 2002, 2003; then firm B in 2004;
        and then firm A again in 2005: this person is still a recent entrant).

        This is the sort of code that was used when constructing recent entrants:

            %macro recent_entrants;
                data recent_entrants;
                    set home.quartile_variables_sorted;

                        recent_entrants=.;
                        %do y = 1999 %to 2014;
                            %let ym1 = %eval(&y. - 1);
                            %let ym2 = %eval(&y. - 2);
                            %let ym3 = %eval(&y. - 3);
                            if tax_yr = &y. then do;
                                if (payer_tin_w2_max ne payer_tin&ym1.) or
                                   (payer_tin_w2_max ne payer_tin&ym2.) or
                                   (payer_tin_w2_max ne payer_tin&ym3.)

                                    then recent_entrants = 1;

                                end;
                            end;
                        %end;

                run;

            %mend;
            %recent_entrants;
