4

I am working with pretty large dataframes, with as an extreme a dataframe with about 300.000 rows and 1.500 variables. Because of that, when working on those dataframes, I sometimes get the error:

Error: cannot allocate vector of size x.x Gb

Mostly this means I have to split up my code into smaller steps, or have to change my approach altogether.

At the moment I am doing several selections and left_join's which look something like this:

#Subsetting the main dataframe
df2 <- select(df1, matchcode, x1, x2, x3)
#Joining variables from a third dataframe
df2 <- df2 %>% left_join(select(df3, matchcode, y1, y2, y3), by="matchcode")

The selection part goes perfectly. The odd thing however, is that I am now getting these errors when using left_join where the amount which cannot be allocated is very small:

Error: cannot allocate vector of size 2.6 Mb
Error: cannot allocate vector of size 4.0 Mb
Error: cannot allocate vector of size 2.6 Mb

Are there other issues which could result in these errors that I am not aware of, or is there a fault in my code?

Alexis
  • 4,950
  • 1
  • 18
  • 37
Tom
  • 2,173
  • 1
  • 17
  • 44

1 Answers1

3

Since posting this question I have done some research. I first thought the errors had to do with the number(size) of objects in my workspace, which was not the case.

The most important answer to my own question (please feel free to elaborate on this), is that the size of the vector which cannot be allocated does not necessarily say a lot about what the operation does to memory.

It turned out that one of the errors was due to me trying to do a many-to-many join on two huge datasets, which created the error:

Error: cannot allocate vector of size 140.4 Mb

The other joins were one-to-many (which did result in significantly smaller errors, see original post). I have been able to join these data frames by using a data.table solution instead;

library(data.table)
df1 <- merge(df1, df2, by= "matchcode", all.x = TRUE, allow.cartesian=TRUE)

For the many-to-many join, I collapsed one of the datasets so the join became a one-to-many. I hope this helps.

Tom
  • 2,173
  • 1
  • 17
  • 44