Spreading very long dataset - performance issues

Question

I'm trying to spread a very long data frame (17,000,000 rows; 111.2MB RDS file) into a wide format by a variable with ~2,000 unique values. Running this on a 16 cores 64GB RAM linux machine results in a Error: cannot allocate vector of size 3132.3GB.

The dplyrcode below works perfectly on smaller datasets (~1/3 the size).

data <- data %>%
  rowid_to_column() %>%
  spread(key = parameter_name, value = value) %>%
  select(-rowid)

Any idea to get this done? More efficient coding?

Could you provide some data to run your code? Have you tried an equivalent solution with {data.table} maybe? — F. Privé, Apr 09 '19 at 18:46
Seems relevant: [What methods can we use to reshape VERY large data sets?](https://stackoverflow.com/questions/55077668/what-methods-can-we-use-to-reshape-very-large-data-sets) — markus, Apr 09 '19 at 18:55
As suggested in one of the comments in @markus' link (my comment, to be fair): if your data is rather large, it might be necessary to `split`, reshape each element individually, then recombine somehow. Care must be taken to ensure columns and rows are properly aligned when recombining, but that's minimal effort in the big scheme of things. — r2evans, Apr 09 '19 at 19:34
You might also try the `widyr` package https://github.com/dgrtwo/widyr — Jon Spring, Apr 09 '19 at 20:02

score 0 · Answer 1 · answered Apr 10 '19 at 06:46

It seems to me that you are trying to create 17e6 x 2000 data.frame which is quite unreasonable (as there will be only one value for each row).

If your values are integer then the resulting data.frame will be of size:

v <- sample.int(17e6)
format(object.size(v) * 2000, units = 'Gb')
# [1] "126.7 Gb"

or double:

v <- rnorm(17e6)
format(object.size(v) * 2000, units = 'Gb')
# [1] "253.3 Gb"

You maybe should look into sparse matrices or rethink of what you are doing.

Test case for 200k x 2000 with data.table:

N <- 200000
n <- 2000
require(data.table)
dt <- data.table(parameter_name = sample.int(n, N, T), value = rnorm(N), id = 1:N)
r <- dcast(dt, id ~ parameter_name)
format(object.size(r), units = 'Gb')
# [1] "3 Gb"

Results already in 3GB.

score -3 · Answer 2 · answered Apr 09 '19 at 20:24

-3

data.table transformations are more memory effective and faster, although dplyr are more convenient.

answered Apr 09 '19 at 20:24

Sergio Alyoshkin

212
1
4

Spreading very long dataset - performance issues

2 Answers2