Split the dataset in R

Question

I have a dataset which contains Billno and Product columns in the following format:

Billno Product
1       123
1       176
2       189
3       1
3       2
3       44
3       46

etc

The output should be a table of the form:

Billno  Prod1  Prod2  Prod3  Prod4
1       123    176
2       189
3       1      2      44      46

Split function works but the dataset contains more than million records. Is there an efficient way of doing this?

Wietze314 · Accepted Answer · 2016-10-13T07:50:38.353

2

with dplyr:

library(dplyr)
library(tidyr)

bill <- rep(c(1,1,2,3,3,3,3),5)
prod <- rep(c(123,176,189, 1,2,44,46),5)

df <- data.frame(bill=bill, prod=prod)
#determine max product count (number of columns in result)
prodmax <- df %>% group_by(bill) %>% summarise(n = n())

df %>% group_by(bill) %>% 
  mutate(prodn = paste0("prod",row_number())) %>% 
  spread(prodn, prod) %>% 
#select columns in correct order
  select_(.dots = c('bill',paste0('prod',seq(1,max(prodmax$n)))))

results in:

      bill  prod1 prod2 prod3 prod4
(dbl) (dbl) (dbl) (dbl) (dbl)
1     1       123   176    NA    NA
2     2       189    NA    NA    NA
3     3         1     2    44    46

edited Oct 13 '16 at 07:50

answered Oct 12 '16 at 12:21

Wietze314

5,942
2
21
40

Hey @Wietze314 thanks a lot...This gives the result much faster than other solutions .. but the columns are ordered as prod1 prod10 prod11 in the output ... how do i change the order to prod1 prod2 prod3 ...... without manually specifying the column numbers – AB6 Oct 13 '16 at 05:48
I have changed the solution. It is one way to make sure that the columns (that are ordered alphabetically at default) are in the correct order. Another solution would be to make sure that the columns are alphabetically already correct (ie prod01, prod02, ... , prod11 etc.) – Wietze314 Oct 13 '16 at 07:52

score 0 · Answer 2 · answered Oct 12 '16 at 12:18

You can do

df <- read.table(header=T, text="Billno Product
1       123
1       176
2       189
3       1
3       2
3       44
3       46")
lst <- split(df[,-1], df[,1])
lst <- lapply(lst, "length<-", max(lengths(lst)))
df <- as.data.frame(do.call(rbind, lst))
#    V1  V2 V3 V4
# 1 123 176 NA NA
# 2 189  NA NA NA
# 3   1   2 44 46

and then

names(df) <- sub("V", "prod", names(df))
df$billno <- rownames(df)

score 0 · Answer 3 · answered Oct 12 '16 at 12:49

This will also do:

l <- lapply(split(df, df$Billno), function(x) t(x)[2,])
df <- as.data.frame(do.call(rbind, lapply(lapply(l, unlist), "[",
                      1:(max(unlist(lapply(l, length)))))))
names(df) <- paste('Prod', 1:ncol(df), sep='')
df
     Prod1 Prod2 Prod3 Prod4
1   123   176    NA    NA
2   189    NA    NA    NA
3     1     2    44    46

Split the dataset in R

3 Answers3