2

So, I have a dataset df 1.4 GBs big, and I am trying to reshape it using the following function:

reshaped <- function(df){
  df %>%
    select(subject_num, concept_code) %>% 
    group_by(subject_num, concept_code) %>%
    count() %>% 
    spread(concept_code, n, fill=0)

  return(df)
}

df = read_rds('df.RDs') %>% 
         mutate(a=paste(a, b, sep="|"))
df <- reshaped(df)
write_rds(df, 'df_reshaped.RDs')

I get: Error: cannot allocate vector of size 1205.6 GB. While debugging I discovered that the code gets stuck at the spread statement inside the reshaped function. I don't see how a dataset of 1.4 GB could ask for 1205.6 GB of memory inside the dplyr code that I wrote. Nothing in the code above seems like duplicating this dataset about 900 times as well, so I am a bit stuck here. Could anyone suggest why is this happening?

ibayramli
  • 71
  • 5
  • Hi & welcome. It will be hard to get any help from this. try commenting out parts of the pipe inside your reshaped function (starting from the end). and see at which step it tries to allocate the vector. Then, check here how to write question for that specific issue https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Benjamin Schwetz Apr 15 '20 at 11:17
  • yes. but at which point in there, the `spread()`, `count()`... etc. we cannot reproduce your problem with the code you provide, so you would need to figure that out. – Benjamin Schwetz Apr 15 '20 at 11:26
  • My bad, I added another update. It gets stuck at spread. – ibayramli Apr 15 '20 at 12:00
  • You are `spread`ing a grouped df; does `ungroup()` before `spread()` help? Apart from that, you could give `pivot_wider()` (the successor of `spread()`) a try. – hplieninger Apr 15 '20 at 13:30
  • nope, didn't work ((( – ibayramli Apr 15 '20 at 18:33
  • The `return(df)` is misplaced, since it would return the original `df`. – hplieninger Apr 16 '20 at 09:48
  • Can you `dput(head(df[, c("subject_num", "concept_code")]))`? Furthermore, what is the result of `length(unique(df$))` for these columns? – hplieninger Apr 16 '20 at 09:50
  • Okay, so I divided the dataset into 100 chunks and processed one of them, which gave me an RDs file of 117 GB. So, the entire dataset should be around 11,7 TB heavy in a compressed format which is insane. I still don't understand how something of size 1.4 GB could turn into such an 11,7 TB monster. – ibayramli Apr 18 '20 at 09:18

0 Answers0