3

I need help with converting my long data of dimension 1558810 x 84 to a wide data of 1558810 x 4784

Let me explain in detail how and why. My raw data is as follows - The data has three main columns -

id  empId   dept
1   a       social
2   a       Hist
3   a       math
4   b       comp
5   a       social
6   b       comp
7   c       math
8   c       Hist
9   b       math
10  a       comp

id is the unique key that tells which employee went to which department in a university on a day. I need this to be transformed as below.

id  empId   dept    social  Hist    math    comp
1   a       social  1       0       0       0
2   a       Hist    0       1       0       0
3   a       math    0       0       1       0
4   b       comp    0       0       0       1
5   a       social  1       0       0       0
6   b       comp    0       0       0       1
7   c       math    0       0       1       0
8   c       Hist    0       1       0       0
9   b       math    0       0       1       0
10  a       comp    0       0       0       1

I have two datasets one with 49k rows and one with 1.55million rows. For the smaller dataset which had 1100 unique department values, I used dcast in the reshape2 package to get the desired dataset(thus, transformed data would have 3+1100 columns and 49k rows). But when I use the same function on my larger dataset that has 4700 unique department values, my R crashes because of Memory issue. I tried varous other alternative like xtabs, reshape etc. but every time it failed with Memory error.

I have now resorted to a crude FOR loop for this purpose -

columns <- unique(ds$dept)

for(i in 1:length(unique(ds$dept))) 
{
  ds[,columns[i]] <- ifelse(ds$dept==columns[i],1,0)
}

But this is extremely slow and the code has been running for 10 hrs now. Is there any workaround for this, that I am missing?

ANy suggestions would be of great help!

RHelp
  • 815
  • 2
  • 8
  • 23
  • 1
    Have you tried any of the methods I suggested? The third method `dcast.data.table` should be faster. Though, I didnt do the benchmarks. – akrun Nov 14 '14 at 07:53
  • The `spread` function of `tidyr` should work nicely with big data. – bdecaf Nov 14 '14 at 08:13

2 Answers2

3

You could try

df$dept <- factor(df$dept, levels=unique(df$dept))
res <- cbind(df,  model.matrix(~ 0+dept, df))
colnames(res) <- gsub("dept(?=[A-Za-z])", "", colnames(res), perl=TRUE)
res
#   id empId   dept social Hist math comp
#1   1     a social      1    0    0    0
#2   2     a   Hist      0    1    0    0
#3   3     a   math      0    0    1    0
#4   4     b   comp      0    0    0    1
#5   5     a social      1    0    0    0
#6   6     b   comp      0    0    0    1
#7   7     c   math      0    0    1    0
#8   8     c   Hist      0    1    0    0
#9   9     b   math      0    0    1    0
#10 10     a   comp      0    0    0    1

Or you could try

cbind(df, as.data.frame.matrix(table(df[,c(1,3)])))

Or using data.table

library(data.table)
setDT(df)
dcast.data.table(df, id + empId + dept ~ dept, fun=length) 

Or using qdap

library(qdap)
cbind(df, as.wfm(with(df, mtabulate(setNames(dept, id)))))

data

df <- structure(list(id = 1:10, empId = c("a", "a", "a", "b", "a", 
"b", "c", "c", "b", "a"), dept = c("social", "Hist", "math", 
"comp", "social", "comp", "math", "Hist", "math", "comp")), .Names = c("id", 
"empId", "dept"), class = "data.frame", row.names = c(NA, -10L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    I think `cbind` can be avoided: `dcast.data.table(df, id + empId + dept ~ dept, fun=length)` – Arun Nov 14 '14 at 09:01
0

Try:

> cbind(dd[1:3], dcast(dd, dd$id~dd$dept, length)[-1])
Using dept as value column: use value.var to override.
   id empId   dept comp Hist math social
1   1     a social    0    0    0      1
2   2     a   Hist    0    1    0      0
3   3     a   math    0    0    1      0
4   4     b   comp    1    0    0      0
5   5     a social    0    0    0      1
6   6     b   comp    1    0    0      0
7   7     c   math    0    0    1      0
8   8     c   Hist    0    1    0      0
9   9     b   math    0    0    1      0
10 10     a   comp    1    0    0      0

data:

> dput(dd)
structure(list(id = 1:10, empId = structure(c(1L, 1L, 1L, 2L, 
1L, 2L, 3L, 3L, 2L, 1L), .Label = c("a", "b", "c"), class = "factor"), 
    dept = structure(c(4L, 2L, 3L, 1L, 4L, 1L, 3L, 2L, 3L, 1L
    ), .Label = c("comp", "Hist", "math", "social"), class = "factor")), .Names = c("id", 
"empId", "dept"), class = "data.frame", row.names = c("1", "2", 
"3", "4", "5", "6", "7", "8", "9", "10"))
rnso
  • 23,686
  • 25
  • 112
  • 234
  • I get an error when I try this - `Error: n must be a positive integer In addition: Warning message: In split_indices(.group, .n) : NAs introduced by coercion` – RHelp Nov 14 '14 at 07:44
  • Another version added above in the answer. – rnso Nov 14 '14 at 08:23
  • 1
    @rnso In the description, the OP mentions `I used dcast in the reshape2 package to get the desired dataset(thus, transformed data would have 3+1100 columns and 49k rows). But when I use the same function on my larger dataset that has 4700 unique department values, my R crashes because of Memory issue` – akrun Nov 14 '14 at 08:42
  • Deleted that part. But how to sort the error using other method? It does not appear to be due to any memory problem and R is not crashing, probably because only 2 vectors are used in formula in dcast. – rnso Nov 14 '14 at 12:17
  • @rnso Regarding the error, I am not sure. Perhaps, the OP could give us more info about that. – akrun Nov 14 '14 at 14:16