4

I have Compustat data in the long format, which I transform to the wide format by spread (from the tidyr package).

Then I do some calculations and after that I want to bring the dataframe back to the long format again. Is there any sort of "memory" function, so that my new long dataframe is in the exact same way as the old one (same order).

The Problem is that there are a lot of NAs and the data of each stock begins when the stock is first listed and ends after it's delisting or at the end of the sample. My sample goes from 1960 to 2015 (quarterly). Of course not all stocks have data for all that dates, but when I go back from the wide to the long format each stock gets all of the dates from 1960.1 to 2015.4. This long format dataframe is part of a score I am building and I have to compare it to other long format dataframes (all of them have the same order by kypermno and date), because of that I need to convert the wide dataframe back into the exact original form just with the new values.

Edit: Here is an example of my problem:

long format 'original' (called 'test'):

    `kypermno fyyyyq ROE_Q
      <int>  <int> <dbl>
1      1001   1985  0.56
2      1001   1986  0.43
3      1001   1987  0.78
4      1001   1988    NA
5      1001   1989  0.34
6      1001   1990  0.76
7      1002   1980  0.12
8      1002   1981  0.67
9      1002   1982  0.12
10     1002   1983  0.56
11     1002   1984    NA
12     1002   1985  0.91
13     1002   1986  0.45
14     1002   1987  0.23
15     1002   1988  0.54
16     1002   1989  0.14
17     1002   1990  0.19
18     1002   1991  0.27`

With the following code I put it in the wide format:

dat_wide <- spread(test, kypermno, ROE_Q)

now in the wide format it looks like this:

 fyyyyq `1001` `1002`
*   <int>  <dbl>  <dbl>
1    1980     NA   0.12
2    1981     NA   0.67
3    1982     NA   0.12
4    1983     NA   0.56
5    1984     NA     NA
6    1985   0.56   0.91
7    1986   0.43   0.45
8    1987   0.78   0.23
9    1988     NA   0.54
10   1989   0.34   0.14
11   1990   0.76   0.19
12   1991     NA   0.27

And when I put it back in the long format it becomes this:

dat_long <- gather(dat_wide, key = 'fyyyyq', value = 'ROE_Q', -kypermno)

fyyyyq kypermno ROE_Q
    <int>    <chr> <dbl>
1    1980     1001    NA
2    1981     1001    NA
3    1982     1001    NA
4    1983     1001    NA
5    1984     1001    NA
6    1985     1001  0.56
7    1986     1001  0.43
8    1987     1001  0.78
9    1988     1001    NA
10   1989     1001  0.34
11   1990     1001  0.76
12   1991     1001    NA
13   1980     1002  0.12
14   1981     1002  0.67
15   1982     1002  0.12
16   1983     1002  0.56
17   1984     1002    NA
18   1985     1002  0.91
19   1986     1002  0.45
20   1987     1002  0.23
21   1988     1002  0.54
22   1989     1002  0.14
23   1990     1002  0.19
24   1991     1002  0.27

As you can see there are a lot more NAs now (since they are created going form long to wide) and NA omit isn't an option, because with that all NAs are omitted (not only the new created one). So when I go back again from the wide format to the long format, I want to get the old (18 rows long) long format dataframe and not the one I got (with 24 rows and "new" NAs).

I hope my problem got comprehensible now.

PS: As you can see I didn't manage to get kypermno in the first column and fyyyyq in the second (after going back to long format), but I think it doesn't affect the issue above.

kayB
  • 151
  • 2
  • 9
  • 2
    Please provide a reproducible problem / example, or give some indication of what the problematic data set looks like and what you would expect it to look like. Read through this for some inspiration on reproducible example. http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Drj May 07 '17 at 00:55

1 Answers1

1

There is a fill= option that allows you to choose the value tobe used to "fill" the gaps. Sadly it replaces also the original NA so it is useless.

This is an inelegant solution using the original test data to eliminate the cases that were not originally present.

NB : I had to modify slightly your code to make it work.

test <- read.table(text = 
'ID kypermno fyyyyq ROE_Q
1      1001   1985  0.56
2      1001   1986  0.43
3      1001   1987  0.78
4      1001   1988    NA
5      1001   1989  0.34
6      1001   1990  0.76
7      1002   1980  0.12
8      1002   1981  0.67
9      1002   1982  0.12
10     1002   1983  0.56
11     1002   1984    NA
12     1002   1985  0.91
13     1002   1986  0.45
14     1002   1987  0.23
15     1002   1988  0.54
16     1002   1989  0.14
17     1002   1990  0.19
18     1002   1991  0.27',
header = TRUE)
test <- test[,-1]

library(tidyr)
dat_wide <- spread(test, kypermno, ROE_Q)
dat_wide

dat_long <- gather(dat_wide, key = 'kypermno', value = ROE_Q, -fyyyyq)
dat_long


# Keep only the original data
dat_long[ paste(dat_long[,2], dat_long[,1]) %in% paste(test[,1], test[,2]),]

# Alternative (shorter and probably better)
merge(test[,1:2], dat_long, all.x=TRUE)

But maybe you should ask yourself whether it is really necessary to transform your data in wide format...

Gilles San Martin
  • 4,224
  • 1
  • 18
  • 31
  • I handled the problem by replacing the original NAs by a letter and when I went back from wide to long I could omit the NAs and replace the letter again with NAs in the "new" long format. I had to go for the wide format because I had to calculate the 60 quarters ROE standard deviation for each stock over time (with at least 12 non-missing observations) and I didn't saw a convenient way to do that in the long format. – kayB May 08 '17 at 07:31