3

I have a CRSP list of stock price as follow

    PERMNO  date        TICKER  RETX
1   10138   2007-01-03  TROW    0.045236
2   10138   2007-01-04  TROW    0.008743
3   10138   2007-01-05  TROW    -0.001950
4   10138   2007-01-08  TROW    0.018237
5   10138   2007-01-09  TROW    0.004051
6   10138   2007-01-10  TROW    0.005734
7   10138   2007-01-11  TROW    0.019637
8   10138   2007-01-12  TROW    0.005591
...
1   10145   2007-01-03  HON -0.003095
2   10145   2007-01-04  HON -0.000443
3   10145   2007-01-05  HON -0.009539
4   10145   2007-01-08  HON 0.006047
5   10145   2007-01-09  HON 0.007124
6   10145   2007-01-10  HON -0.006189
7   10145   2007-01-11  HON 0.016681
8   10145   2007-01-12  HON -0.003282
9   10145   2007-01-16  HON 0.001317
10  10145   2007-01-17  HON -0.001754
11  10145   2007-01-18  HON -0.010979
...

Once I use tidyr::spread(x,TICKER,RETX), it return a matrix with most of value to be NA. Is there is any other function to rearrange the matrix, listing each stock price in one column? or how to achieve it by a few line?

update: I figure out it is the PERMNO column causing the problem. After I get rid of the PERMNO column another problem appear:

> spread(A1[,2:4],TICKER,RETX)
Error: Duplicate identifiers for rows (129717, 143815), (129718, 143816), ...

So, I just randomly select two row mentioned in the message

       PERMNO       date TICKER     RETX
129717  75104 2007-01-03    CBS 0.012172
> A1[143815,]
       PERMNO       date TICKER    RETX
143815  76226 2007-01-03    CBS 0.01347

Turn out the data set is quite dirty and it contains duplicated series. A better solution would be using PERMNO as key. Here is what I get

    date        10225       10516       10909       ...
1   2007-01-03  0.005738    0.003129    -0.006593   ...
2   2007-01-04  -0.011062   -0.005615   0.028761    ...
3   2007-01-05  0.000824    -0.001568   -0.022366   ...
4   2007-01-08  -0.005059   0.005027    -0.003520   ...
5   2007-01-09  0.002956    -0.024383   0.000883    ...
6   2007-01-10  -0.003301   -0.008651   -0.010587   ...
...

It's frustrating, but I finally get something. Is there anyway to replace the numerical column name with the matching TICKER. Here is a demo

    PERMNO  date        FO          HON        ...
1   10225   2007-01-03  0.005738    -0.003095  ...
2   10225   2007-01-04  -0.011062   -0.000443  ...
3   10225   2007-01-05  0.000824    -0.009539  ...
4   10225   2007-01-08  -0.005059   0.006047   ...
5   10225   2007-01-09  0.002956    0.007124   ...
6   10225   2007-01-10  -0.003301   -0.006189  ...
7   10225   2007-01-11  0.007925    0.016681   ...
8   10225   2007-01-12  -0.010914   -0.003282  ...
HLD25
  • 39
  • 1
  • 2
  • 4
  • 1
    I assume you mean `tidyr::spread` (I edited your question accordingly). You example is poor because it only shows one value of `RETX`, so your problem isn't illustrated---your code runs just fine on these 8 rows with no `NA`s returned. (And it returns a data frame, not a matrix.) `tidyr` isn't built-in, but you can see `reshape2::dcast` for a more flexible version of `spread`. – Gregor Thomas Feb 12 '15 at 00:53
  • However, if you're getting `NA`s, I assume you have different `permno` values or different dates for your stocks--probably permno. So if that column isn't applicable to your spread data, just drop it before spreading. – Gregor Thomas Feb 12 '15 at 00:56
  • Thanks for the advice. It was my first time posting question on this site, I am still working on it. – HLD25 Feb 12 '15 at 00:58
  • No problem! Just edit your question with a bit of data that shows the issue. You can see more [reproducibility tips here](http://stackoverflow.com/q/5963269/903061). But do edit new data into your question, otherwise we can't see the problem or test a solution. Try to find 8 or 10 rows, with two TICKER values, 4 or 5 dates each, that show the problem. And maybe explain what PERMNO is. – Gregor Thomas Feb 12 '15 at 01:01
  • 1
    There's your problem then. If you need more help, edit the code into your question instead of putting lots of code in comments. If that solves your problem you might just want to delete the question. – Gregor Thomas Feb 12 '15 at 01:13

1 Answers1

4

If you have duplicated data in places, you'll first need to get rid of those values because otherwise, if you use tidyr::spread, it will replace the value with the length. Anyway, assuming that you've removed duplicates using unique or something similar, here's how I would do it with tidyr, since that's what you asked and since tidyr is pretty nifty and concise:

 A1 <- spread(A1[, c("date", "TICKER", "RETX")], TICKER, RETX)

If you include PERMNO, you will get NAs for every line where that particular value of TICKER doesn't have a matching value in PERMNO.

shirewoman2
  • 1,842
  • 4
  • 19
  • 31