3

I have a data frame with non-numeric values with the following format:

DF1:

  col1    col2
1  a       b
2  a       c
3  z       y
4  z       x
5  a       d
6  m       n

I need to convert it into this format,

DF2:

  col1    col2    col3    col4
1  a        b       c       d
2  z        y       x       NA
3  m        n       NA      NA

With col1 as the primary key (not sure if this is the right terminology in R), and the rest of the columns contain the elements associated with that key (as seen in DF1).

DF2 will include more columns compared to DF1 depending upon the number of elements associated with any key.

Some columns will have no value resulting from different number of elements associated with each key, represented as NA (as shown in DF2).

The column names could be anything.

I have tried to use the reshape(), melt() + cast(), even a generic for loop where I use cbind and try to delete the row.

It is part of a very big dataset with over 50 million rows. I might have to use cloud services for this task but that is a different discussion.

I am new to R so there might be some obvious solution which I am missing.

Any help would be much appreciated.

-Thanks

UT-
  • 83
  • 1
  • 8

2 Answers2

3

If this is a big dataset, we can use data.table

library(data.table)
setDT(DF1)[, i1:=paste0("col", seq_len(.N)+1L), col1]
dcast(DF1, col1~i1, value.var='col2')
#    col1 col2 col3 col4
#1:    a    b    c    d
#2:    m    n   NA   NA
#3:    z    y    x   NA
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thank you @akron! The solution works! As i am currently testing my code on a small dataset (~1k rows), I was wondering why this solution is preferable compared to one given by @rawr for a bigger dataset (~50m rows). Should I run my whole dataset for both solutions or is there a clean winner here. Again sorry, I'm new to R – UT- Mar 05 '16 at 20:31
2

Using dplyr and tidyr :

library(tidyr)
library(dplyr)

DF <- data_frame(col1 = c("a", "a", "z", "z", "a", "m"),
                 col2 = c("b", "c", "y", "x", "d", "n"))
# you need to another column as key value for spreading
DF %>%
  group_by(col1) %>%
  mutate(colname = paste0("col", 1:n() + 1)) %>%
  spread(colname, col2)
#> Source: local data frame [3 x 4]
#> Groups: col1 [3]
#> 
#>    col1  col2  col3  col4
#>   (chr) (chr) (chr) (chr)
#> 1     a     b     c     d
#> 2     m     n    NA    NA
#> 3     z     y     x    NA
cderv
  • 6,272
  • 1
  • 21
  • 31
  • Thank you! The solution works effectively! However, the link [link](http://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly) states that data.table is comparatively faster than dplyr when it comes to very large datasets, which I need to deal with. So i will try to use the solution provided by @akrun when I run it on the big dataset. Still, thanks a lot for the solution! – UT- Mar 05 '16 at 20:50
  • Yes data.table is known to be performant on very big dataset, one reason is because working by reference. However dplyr is also performant on big dataset, even if it is not working by reference. Just for you to know. – cderv Mar 05 '16 at 21:25