Creating an Identification variable to identify observations

Question

Hello I have a data set called "Sample" like this

Sample 
A tibble: 221,088 x 7
   gvkey  two_digit_sic fyear    part1   part2   part3 part4
   <chr>  <chr>         <dbl>    <dbl>   <dbl>   <dbl> <dbl>
 1 001003 57             1987  0.0317  0.0686   0.0380 0.157
 2 001003 57             1988 -0.358   0.0623  -0.338  0.162
 3 001003 57             1989 -0.155   0.0614  -0.784  0.140
 4 001004 50             1988  0.0868  0.00351  0.108  0.300
 5 001004 50             1989  0.0176  0.00281  0.113  0.296
 6 001004 50             1990 -0.0569  0.00257  0.0618 0.291
 7 001004 50             1991  0.00317 0.00263 -0.112  0.314
 8 001004 50             1992 -0.0418  0.00253 -0.0479 0.300
 9 001004 50             1993  0.00763 0.00274  0.0216 0.334
10 001004 50             1994 -0.0115  0.00239  0.0459 0.307
# ... with 221,078 more rows

count(Sample, gvkey)

# A tibble: 23,978 x 2
   gvkey      n
   <chr>  <int>
 1 001003     3
 2 001004    30
 3 001009     7
 4 001010    16
 5 001011     7
 6 001012     2
 7 001013    23
 8 001014     5
 9 001017     8
10 001019    14
# ... with 23,968 more rows

count(Sample, two_digit_sic)

# A tibble: 73 x 2
   two_digit_sic     n
   <chr>         <int>
 1 01              527
 2 02              111
 3 07              105
 4 08              120
 5 09               24
 6 10             8860
 7 12              477
 8 13            11200
 9 14              811
10 15              858
# ... with 63 more rows

Then I run the following model

library(dplyr)
library(broom)
mjones_1991 <- Sample %>% 
  group_by(two_digit_sic, fyear) %>% 
  filter(n()>=10) %>% 
  do (augment (lm (part1 ~ part2 + part3 + part4, data = .))) %>% 
  ungroup()

Then I got the following results

mjones_1991

# A tibble: 219,587 x 13
   two_digit_sic fyear    part1   part2   part3 part4  .fitted .se.fit  .resid
   <chr>         <dbl>    <dbl>   <dbl>   <dbl> <dbl>    <dbl>   <dbl>   <dbl>
 1 01             1988 -0.0478  2.36e-2  0.147  1.01  -0.119    0.0576  0.0714 
 2 01             1988 -0.174   4.29e-2  0.327  0.810  0.00104  0.0560 -0.175  
 3 01             1988  0.0250  6.15e-4  0.422  0.619  0.0534   0.0711 -0.0284 
 4 01             1988 -0.0974  2.55e-2 -0.0134 0.292 -0.0847   0.0586 -0.0127 
 5 01             1988 -0.142   1.15e-3  0.0233 0.677 -0.137    0.0489 -0.0058
 6 01             1988 -0.479   2.46e-1 -0.0552 0.538 -0.0393   0.0635 -0.439  
 7 01             1988  0.00861 2.78e-1  0.251  1.58  -0.0407   0.122   0.0493 
 8 01             1988 -0.154   2.94e-2 -0.348  0.619 -0.284    0.0984  0.131  
 9 01             1988 -0.0526  8.96e-4  0.172  0.602 -0.0580   0.0452  0.0053
10 01             1988 -0.0574  2.15e-2  0.0535 0.316 -0.0596   0.0540  0.0021
# ... with 219,577 more rows, and 4 more variables: .hat <dbl>, .sigma <dbl>,
#   .cooksd <dbl>, .std.resid <dbl>

The problem is that I lost gvkey; therefore, I cannot identify the .fitted or .se.fit or .resid is for which gvkey.

Here is the filtering of the two_digit_sic == "01" and fyear == "1988"

# A tibble: 18 x 7
   gvkey  two_digit_sic fyear    part1    part2   part3  part4
   <chr>  <chr>         <dbl>    <dbl>    <dbl>   <dbl>  <dbl>
 1 001266 01             1988 -0.0478  0.0236    0.147  1.01  
 2 002249 01             1988 -0.174   0.0429    0.327  0.810 
 3 002812 01             1988  0.0250  0.000615  0.422  0.619 
 4 003702 01             1988 -0.0974  0.0255   -0.0134 0.292 
 5 008596 01             1988 -0.142   0.00115   0.0233 0.677 
 6 009062 01             1988 -0.479   0.246    -0.0552 0.538 
 7 009391 01             1988  0.00861 0.278     0.251  1.58  
 8 010390 01             1988 -0.154   0.0294   -0.348  0.619 
 9 010884 01             1988 -0.0526  0.000896  0.172  0.602 
10 012349 01             1988 -0.0574  0.0215    0.0535 0.316 
11 012750 01             1988  0.0577  0.0157    0.0794 0.422 
12 013155 01             1988  0.117   0.124     0.370  0.829 
13 013462 01             1988  0.255   0.0828    0.529  0.270 
14 013468 01             1988 -0.0774  0.0445    0.129  0.191 
15 013550 01             1988 -0.0219  0.0204    0.0375 0.879 
16 013743 01             1988 -0.0911  0.228     0.0870 0.739 
17 014400 01             1988  0.415   0.546     0.0710 0.0437
18 014881 01             1988 -0.134   0.00380   0.0211 0.666

You can see I have 18 observations for two_digit_sic == "01" and fyear == "1988"

in ~~~mjones_1991~~~ data set I have the same observations, but I lose the identifiers (gvkey). Do you have any idea how can keep the gvkey?

Or `left_join(mjones_1991, Sample %>% select(gvkey, two_digit_sic))` to add them back afterwards. — Jon Spring, Jun 25 '19 at 17:51
@akrun, I cannot do it because the paper I used to build the model did not do that. They create the model by two_digit_sic and fyear. — Sharif, Jun 25 '19 at 17:51
Okay, then do a left_join after that with the unique gvkey, two_digit_sic from the data — akrun, Jun 25 '19 at 17:52
I think you can do something like `do ( data.frame(gvkey = .$gvkey, augment (lm (part1 ~ part2 + part3 + part4, data = .))))`. But you should check that that lines things up correctly! :-D — aosmith, Jun 25 '19 at 17:58
You can also do something like `lm(part1 ~ part2 + part3 + part4 + 0*gvkey)`. — Curt F., Jul 19 '19 at 00:40
Possible duplicate of [Creating models and augmenting data without losing additional columns in dplyr/broom](https://stackoverflow.com/questions/48626980/creating-models-and-augmenting-data-without-losing-additional-columns-in-dplyr-b) — Curt F., Jul 19 '19 at 20:00

Creating an Identification variable to identify observations

0 Answers0