Cannot allocate vector of size Gb while performing inner_join in R

Question

Solution

After reading the comments posted by @Jon Spring, @Sweep Dodo and @Gregor Thomas carefully, I realized the issue was a large number of duplicate keys in the chr column. I simply created new columns by pasting chr:start together for both tables so there would be a lot fewer duplicate entries. After this the inner_join completes in a couple of seconds.

Issue

Running 64-bit R 4.1.2 on Windows 11.

I have two tables

Table 1:

chr	start	end
chr7	117120017	117120018
chr7	117120018	117120019
chr7	117120019	117120020
chr7	117120020	117120021
chr7	117120021	117120022
chr7	117120022	117120023

188700 rows x 3 columns

Table 2:

chr	starthg38	endhg38
chr7	117479963	117479964
chr7	117479964	117479965
chr7	117479965	117479966
chr7	117479966	117479967
chr7	117479967	117479968
chr7	117479968	117479969

188700 rows x 3 columns

I try performing simple inner join,

new_table <- inner_join(table1, table2, by = c("chr" = "chr"))

and get the following error

Error: cannot allocate vector of size 132.7 Gb

Based on solutions suggested here R memory management / cannot allocate vector of size n Mb

I tried,

gc()

and

memory.size(max = TRUE)

but neither of these solutions worked. More importantly, I'm trying to understand why R thinks allocating 132.7 Gb is necessary for such a small join operation.

I'm guessing from the `start****` and `end****` columns that you are going to follow up with some sort of filtering operation, perhaps to find overlaps within matching `chr` values? If so, it will be more memory efficient and faster to do a non-equi join using `data.table` or `sqldf` then doing a big join where every `ch7` value in each table is matched with all the same values in the other, effectively squaring the number of rows per `chr7` value and making potentially a very large table. In the extreme case, if every row was `chr7`, your cartesian join would have 35 billion rows. — Jon Spring, Feb 22 '22 at 16:53
A vector of 132.7 GB is indeed a large vector. In my experience, even an 8GB vector cannot be created (on my lap top. Depends on RAM). As @JonSpring pointed out. Your data have many repeating values in join keys e.g. `ch7`. Now think of an extreme example where both your dfs have `ch7` in both join keys only. This boils down to a cross join where the number of rows resulted is `nrow(df1) * nrow(df2)`. In your case 188,700^2 = 35,607,690,000 Thus, not as small a join as it seemed. — Sweepy Dodo, Feb 22 '22 at 17:02
Suggested duplicates [Merge (join) data frames - too many rows in result](https://stackoverflow.com/q/35956108/903061), or [INNER JOIN in MySQL returns multiple entries of the same row](https://stackoverflow.com/q/50750381/903061) — Gregor Thomas, Feb 22 '22 at 17:09

Cannot allocate vector of size Gb while performing inner_join in R

0 Answers0