Hello I have a data set called "Sample" like this
Sample
A tibble: 221,088 x 7
gvkey two_digit_sic fyear part1 part2 part3 part4
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 001003 57 1987 0.0317 0.0686 0.0380 0.157
2 001003 57 1988 -0.358 0.0623 -0.338 0.162
3 001003 57 1989 -0.155 0.0614 -0.784 0.140
4 001004 50 1988 0.0868 0.00351 0.108 0.300
5 001004 50 1989 0.0176 0.00281 0.113 0.296
6 001004 50 1990 -0.0569 0.00257 0.0618 0.291
7 001004 50 1991 0.00317 0.00263 -0.112 0.314
8 001004 50 1992 -0.0418 0.00253 -0.0479 0.300
9 001004 50 1993 0.00763 0.00274 0.0216 0.334
10 001004 50 1994 -0.0115 0.00239 0.0459 0.307
# ... with 221,078 more rows
count(Sample, gvkey)
# A tibble: 23,978 x 2
gvkey n
<chr> <int>
1 001003 3
2 001004 30
3 001009 7
4 001010 16
5 001011 7
6 001012 2
7 001013 23
8 001014 5
9 001017 8
10 001019 14
# ... with 23,968 more rows
count(Sample, two_digit_sic)
# A tibble: 73 x 2
two_digit_sic n
<chr> <int>
1 01 527
2 02 111
3 07 105
4 08 120
5 09 24
6 10 8860
7 12 477
8 13 11200
9 14 811
10 15 858
# ... with 63 more rows
Then I run the following model
library(dplyr)
library(broom)
mjones_1991 <- Sample %>%
group_by(two_digit_sic, fyear) %>%
filter(n()>=10) %>%
do (augment (lm (part1 ~ part2 + part3 + part4, data = .))) %>%
ungroup()
Then I got the following results
mjones_1991
# A tibble: 219,587 x 13
two_digit_sic fyear part1 part2 part3 part4 .fitted .se.fit .resid
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 01 1988 -0.0478 2.36e-2 0.147 1.01 -0.119 0.0576 0.0714
2 01 1988 -0.174 4.29e-2 0.327 0.810 0.00104 0.0560 -0.175
3 01 1988 0.0250 6.15e-4 0.422 0.619 0.0534 0.0711 -0.0284
4 01 1988 -0.0974 2.55e-2 -0.0134 0.292 -0.0847 0.0586 -0.0127
5 01 1988 -0.142 1.15e-3 0.0233 0.677 -0.137 0.0489 -0.0058
6 01 1988 -0.479 2.46e-1 -0.0552 0.538 -0.0393 0.0635 -0.439
7 01 1988 0.00861 2.78e-1 0.251 1.58 -0.0407 0.122 0.0493
8 01 1988 -0.154 2.94e-2 -0.348 0.619 -0.284 0.0984 0.131
9 01 1988 -0.0526 8.96e-4 0.172 0.602 -0.0580 0.0452 0.0053
10 01 1988 -0.0574 2.15e-2 0.0535 0.316 -0.0596 0.0540 0.0021
# ... with 219,577 more rows, and 4 more variables: .hat <dbl>, .sigma <dbl>,
# .cooksd <dbl>, .std.resid <dbl>
The problem is that I lost gvkey; therefore, I cannot identify the .fitted or .se.fit or .resid is for which gvkey.
Here is the filtering of the two_digit_sic == "01" and fyear == "1988"
# A tibble: 18 x 7
gvkey two_digit_sic fyear part1 part2 part3 part4
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 001266 01 1988 -0.0478 0.0236 0.147 1.01
2 002249 01 1988 -0.174 0.0429 0.327 0.810
3 002812 01 1988 0.0250 0.000615 0.422 0.619
4 003702 01 1988 -0.0974 0.0255 -0.0134 0.292
5 008596 01 1988 -0.142 0.00115 0.0233 0.677
6 009062 01 1988 -0.479 0.246 -0.0552 0.538
7 009391 01 1988 0.00861 0.278 0.251 1.58
8 010390 01 1988 -0.154 0.0294 -0.348 0.619
9 010884 01 1988 -0.0526 0.000896 0.172 0.602
10 012349 01 1988 -0.0574 0.0215 0.0535 0.316
11 012750 01 1988 0.0577 0.0157 0.0794 0.422
12 013155 01 1988 0.117 0.124 0.370 0.829
13 013462 01 1988 0.255 0.0828 0.529 0.270
14 013468 01 1988 -0.0774 0.0445 0.129 0.191
15 013550 01 1988 -0.0219 0.0204 0.0375 0.879
16 013743 01 1988 -0.0911 0.228 0.0870 0.739
17 014400 01 1988 0.415 0.546 0.0710 0.0437
18 014881 01 1988 -0.134 0.00380 0.0211 0.666
You can see I have 18 observations for two_digit_sic == "01" and fyear == "1988"
in ~~~mjones_1991~~~ data set I have the same observations, but I lose the identifiers (gvkey). Do you have any idea how can keep the gvkey?