0

I want to cast my data (data.frame) from long to wide format with the values of "ITEM" as columns and values ("ITEM2") (see below):

Long format:

Wide format:

Therefore I use the dcast-function from the package reshape2:

df <= dcast(df,SEQUENCEID + EVENTID ~ ITEM, value.var="ITEM2")

Doing this everything works fine. But with 7m data records in my data frame I was struggling with memory limits. Thus I decided to use ddply from the package plyr.

To make sure that every split has the same columns in the same order I extract the values from "ITEM" in advance, append the column with N/A if does not exist and order all columns alphabetically.

Below the whole code:

#Example data
lo_raw <- data.frame(SEQUENCEID=rep(1546842, 10),
               EVENTID=c(5468503146,5468503146,5468503146,5468503147,5468503147,5468503148,5468503148,5468503148,5468503148,5468503148),
               ITEM =c("cakes","limonade","coffee","coffee","juice","limonade","cakes","water","fruits","vegetable"),
               ITEM2=c("cakes","limonade","coffee","coffee","juice","limonade","cakes","water","fruits","vegetable"),
               SPLIT=rep(1547000, 10))

#Extract items 
item <- as.character(unique(lo_raw$ITEM))

#Function dcast
castff <- function(df,item){

  df = dcast(df, SEQUENCEID + EVENTID ~ ITEM, value.var="ITEM2")

  for(i in item){
    if (!(i %in% colnames(df))){
      df[,i] <- NA
    }
  }

  df <- df[,c(1,2,(order(colnames(df[,3:dim(df)[2]])))+2)]
  df
}

#Apply dcast
df_pivot <- ddply(lo_raw, .(SPLIT), .fun=function(lo_raw,item) castff(lo_raw,item), item=item, .progress="text", .inform=TRUE)

When executing ddply, used RAM is increasing at runtim until it reached its max (12 GB). Thus performance is very slow and I terminated R after a couple of hours.

Is there an alternative way to cast the whole dataset?

Thanks in advance.

silem
  • 21
  • 7
  • 1
    You might try the `dcast` function in `data.table`. I've generally had good luck with memory limits when I use it. – lmo Jan 05 '17 at 16:18
  • Do not post your data as an image, please learn how to give a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610) – Jaap Jan 05 '17 at 16:26
  • added example data for reproducing – silem Jan 05 '17 at 16:53
  • @lmo thanks for your comment. You're right, dcast from data.table ist more efficient. Nevertheless, I have to split the data (else: cannot allocate vector ...) and I come to the same problem, that RAM reached it's max. I think it's because R has to hold all splits in main memory to return the whole casted dataset. I've tried to use split from data.table and dcast all elements from the returned list in a loop. But it leads to the same issue. Is there maybe an alternative way to dcast the data? – silem Jan 08 '17 at 21:35
  • At this point, you may want to move to a machine with more RAM, or use some chunking method. One version of chunking would be to perform `dcast` on 1/n of the data, then write to disk. Delete the reshaped object, then repeat. Include as much as you can for each. – lmo Jan 08 '17 at 21:41

0 Answers0