2

I have the following table:

perid   date    rating
10001   2005    RD
10001   2006    GN
10001   2007    GD
10002   2008    GD
10002   2009    YW
10002   2010    GN
10002   2011    GN
10003   2005    GD
10003   2006    GN
10003   2007    YW

how can I turn this table to the following format:

perid    2005    2006    2007    2008    2009    2010    2011
10001    RD      GN      GN      N/A     N/A     N/A     N/A
10002    N/A     N/A     N/A     GD      YW      GN      GN
10003    GD      GN      YW      N/A     N/A     N/A     N/A

Or if I can do this in R?

Thanks, P

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
PMa
  • 1,751
  • 7
  • 22
  • 28

2 Answers2

3

In base R, the function to use would be reshape, and you would be converting your data from "long" to "wide".

reshape(mydf, direction = "wide", idvar="perid", timevar="date")
#   perid rating.2005 rating.2006 rating.2007 rating.2008 rating.2009 rating.2010 rating.2011
# 1 10001          RD          GN          GD        <NA>        <NA>        <NA>        <NA>
# 4 10002        <NA>        <NA>        <NA>          GD          YW          GN          GN
# 8 10003          GD          GN          YW        <NA>        <NA>        <NA>        <NA>

Alternatively, you can look at dcast from the "reshape2" package and try:

library(reshape2)
dcast(mydf, perid ~ date, value.var="rating")
#   perid 2005 2006 2007 2008 2009 2010 2011
# 1 10001   RD   GN   GD <NA> <NA> <NA> <NA>
# 2 10002 <NA> <NA> <NA>   GD   YW   GN   GN
# 3 10003   GD   GN   YW <NA> <NA> <NA> <NA>

For better speed, convert your data.frame to a data.table and use dcast.data.table instead.

library(reshape2)
library(data.table)
DT <- data.table(mydf)
dcast.data.table(DT, perid ~ date, value.var = "rating")
#    perid 2005 2006 2007 2008 2009 2010 2011
# 1: 10001   RD   GN   GD   NA   NA   NA   NA
# 2: 10002   NA   NA   NA   GD   YW   GN   GN
# 3: 10003   GD   GN   YW   NA   NA   NA   NA

From your comments, it sounds like you have duplicated values among the combinations of column 1 and 2, which means that by default, dcast will use length as its aggregation function.

To get past this, you need to make a secondary ID (or "time", actually) column, which can be done like this.

First, some sample data. Note the duplicated combination of the first two columns in rows 1 and 2.

mydf <- data.frame(
  period = c(10001, 10001, 10002, 10002, 10003, 10003, 10001, 10001),
  date = c(2005, 2005, 2006, 2007, 2005, 2006, 2006, 2007),
  rating = c("RD", "GN", "GD", "GD", "YW", "GN", "GD", "YN"))
mydf
#   period date rating
# 1  10001 2005     RD
# 2  10001 2005     GN
# 3  10002 2006     GD
# 4  10002 2007     GD
# 5  10003 2005     YW
# 6  10003 2006     GN
# 7  10001 2006     GD
# 8  10001 2007     YN

When you try dcast, it just "counts" the number under each combination.

## Not what you want
dcast(mydf, period ~ date, value.var="rating")
# Aggregation function missing: defaulting to length
#   period 2005 2006 2007
# 1  10001    2    1    1
# 2  10002    0    1    1
# 3  10003    1    1    0

Either decide which duplicated row should be dropped, or, if all the data belongs in your dataset, add a "time" variable, like this:

mydf$time <- ave(1:nrow(mydf), mydf$period, mydf$date, FUN = seq_along)
mydf
#   period date rating time
# 1  10001 2005     RD    1
# 2  10001 2005     GN    2
# 3  10002 2006     GD    1
# 4  10002 2007     GD    1
# 5  10003 2005     YW    1
# 6  10003 2006     GN    1
# 7  10001 2006     GD    1
# 8  10001 2007     YN    1

Now, dcast should work fine. Here's a semi-long version...

dcast(mydf, period + time ~ date, value.var="rating")
#   period time 2005 2006 2007
# 1  10001    1   RD   GD   YN
# 2  10001    2   GN <NA> <NA>
# 3  10002    1 <NA>   GD   GD
# 4  10003    1   YW   GN <NA>

... and a semi-wide version.

dcast(mydf, period ~ date + time, value.var="rating")
#   period 2005_1 2005_2 2006_1 2007_1
# 1  10001     RD     GN     GD     YN
# 2  10002   <NA>   <NA>     GD     GD
# 3  10003     YW   <NA>     GN   <NA>
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • I am getting an error `Error in [.data.frame`(data, , idvar) : undefined columns selecte`` :( – PMa Apr 11 '14 at 04:36
  • +1 - I thought you had to `melt` before `dcast`, but apparently not. – James King Apr 11 '14 at 04:37
  • @user3114046, if the data are already in a long form (which is what `melt` does), then there is no need to `melt` again before using `dcast`. – A5C1D2H2I1M1N2O1R2T1 Apr 11 '14 at 04:39
  • @ Ananda - `reshape` worked, you are right, I misspelled the persid, but with `dcast`, the result I got is `1` and `0` rather than `GN`, `RD`. – PMa Apr 11 '14 at 04:44
  • @AnandaMahto, also with `reshape`, the outcome mismatched by 1 column, i.e. col `rating.2005` has all the persid, col `rating.2006` has the data for `rating.1005`... not sure what I did wrong :( – PMa Apr 11 '14 at 05:01
  • @PerriMa, it seems to work fine with your sample data. If you can share some sample data that reproduces your problem (preferably using `dput` or by generating sample data (see here: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), you are more likely to get answers that would be more likely to work with your given scenario. – A5C1D2H2I1M1N2O1R2T1 Apr 11 '14 at 05:38
  • thanks, I figured out the issue with `reshape`, it is just `dcast` is not giving me text but 1 and 0s. – PMa Apr 11 '14 at 06:11
  • @Perri Ma, Ananda has solved your problem for you but you have not accepted his answer. – James King Apr 11 '14 at 16:17
  • @user3114046 sorry for that, I just accepted the Ananda's solution, but just for the `reshape` not `dcast`. – PMa Apr 11 '14 at 16:24
  • @AnandaMahto, the `reshape` worked beautifully. THANK YOU!!! Re `dcast`, is there an easy way for me to email you the code and results? I know you sent me a link but I have not had a chance to dig into it. Essentially, the output is the same as using `reshape` but instead of text, it is giving me `1`s and `0`s. It seems like it is counting the data... – PMa Apr 11 '14 at 16:37
  • @AnandaMahto - I got `TRUE` – PMa Apr 11 '14 at 16:46
  • when you say `present ID`, you mean the `perid` column? – PMa Apr 11 '14 at 16:52
  • @AnandaMahto, sorry 1 more question, the `reshape` function is not effected by the duplicates? (since I got the text results not 1s and 0s? – PMa Apr 11 '14 at 22:21
  • @PerriMa, you would have probably gotten some warnings because of the duplicates, where only the first value would have been used. The number of warnings would depend on the number of observed duplicates. – A5C1D2H2I1M1N2O1R2T1 Apr 15 '14 at 10:41
0

Simple way of doing this is using the reshape2 package -

period <- c(10001,10001,10001,10002,10002,10002,10002,10003,10003,10003)  
date  <- c(2005, 2006,2007,2008, 2009,2010,2011,2005,2006,2007)  
rating <- c("RD","GN","GD","GD","YW","GN", "GN","GD", "GN","YW")
a <- data.frame(period,date,rating)

library(reshape2)
b <- dcast(a,formula=period~date,value.var="rating")

b
>b
   period 2005 2006 2007 2008 2009 2010 2011
1  10001   RD   GN   GD <NA> <NA> <NA> <NA>
2  10002 <NA> <NA> <NA>   GD   YW   GN   GN
3  10003   GD   GN   YW <NA> <NA> <NA> <NA> 
RHelp
  • 815
  • 2
  • 8
  • 23