I have Compustat data in the long format, which I transform to the wide format by spread
(from the tidyr
package).
Then I do some calculations and after that I want to bring the dataframe back to the long format again. Is there any sort of "memory" function, so that my new long dataframe is in the exact same way as the old one (same order).
The Problem is that there are a lot of NA
s and the data of each stock begins when the stock is first listed and ends after it's delisting or at the end of the sample. My sample goes from 1960 to 2015 (quarterly). Of course not all stocks have data for all that dates, but when I go back from the wide to the long format each stock gets all of the dates from 1960.1 to 2015.4. This long format dataframe is part of a score I am building and I have to compare it to other long format dataframes (all of them have the same order by kypermno and date), because of that I need to convert the wide dataframe back into the exact original form just with the new values.
Edit: Here is an example of my problem:
long format 'original' (called 'test'):
`kypermno fyyyyq ROE_Q
<int> <int> <dbl>
1 1001 1985 0.56
2 1001 1986 0.43
3 1001 1987 0.78
4 1001 1988 NA
5 1001 1989 0.34
6 1001 1990 0.76
7 1002 1980 0.12
8 1002 1981 0.67
9 1002 1982 0.12
10 1002 1983 0.56
11 1002 1984 NA
12 1002 1985 0.91
13 1002 1986 0.45
14 1002 1987 0.23
15 1002 1988 0.54
16 1002 1989 0.14
17 1002 1990 0.19
18 1002 1991 0.27`
With the following code I put it in the wide format:
dat_wide <- spread(test, kypermno, ROE_Q)
now in the wide format it looks like this:
fyyyyq `1001` `1002`
* <int> <dbl> <dbl>
1 1980 NA 0.12
2 1981 NA 0.67
3 1982 NA 0.12
4 1983 NA 0.56
5 1984 NA NA
6 1985 0.56 0.91
7 1986 0.43 0.45
8 1987 0.78 0.23
9 1988 NA 0.54
10 1989 0.34 0.14
11 1990 0.76 0.19
12 1991 NA 0.27
And when I put it back in the long format it becomes this:
dat_long <- gather(dat_wide, key = 'fyyyyq', value = 'ROE_Q', -kypermno)
fyyyyq kypermno ROE_Q
<int> <chr> <dbl>
1 1980 1001 NA
2 1981 1001 NA
3 1982 1001 NA
4 1983 1001 NA
5 1984 1001 NA
6 1985 1001 0.56
7 1986 1001 0.43
8 1987 1001 0.78
9 1988 1001 NA
10 1989 1001 0.34
11 1990 1001 0.76
12 1991 1001 NA
13 1980 1002 0.12
14 1981 1002 0.67
15 1982 1002 0.12
16 1983 1002 0.56
17 1984 1002 NA
18 1985 1002 0.91
19 1986 1002 0.45
20 1987 1002 0.23
21 1988 1002 0.54
22 1989 1002 0.14
23 1990 1002 0.19
24 1991 1002 0.27
As you can see there are a lot more NAs now (since they are created going form long to wide) and NA omit isn't an option, because with that all NAs are omitted (not only the new created one). So when I go back again from the wide format to the long format, I want to get the old (18 rows long) long format dataframe and not the one I got (with 24 rows and "new" NAs).
I hope my problem got comprehensible now.
PS: As you can see I didn't manage to get kypermno in the first column and fyyyyq in the second (after going back to long format), but I think it doesn't affect the issue above.