0

Solution

After reading the comments posted by @Jon Spring, @Sweep Dodo and @Gregor Thomas carefully, I realized the issue was a large number of duplicate keys in the chr column. I simply created new columns by pasting chr:start together for both tables so there would be a lot fewer duplicate entries. After this the inner_join completes in a couple of seconds.

Issue

Running 64-bit R 4.1.2 on Windows 11.

I have two tables

Table 1:

chr start end
chr7 117120017 117120018
chr7 117120018 117120019
chr7 117120019 117120020
chr7 117120020 117120021
chr7 117120021 117120022
chr7 117120022 117120023

188700 rows x 3 columns

Table 2:

chr starthg38 endhg38
chr7 117479963 117479964
chr7 117479964 117479965
chr7 117479965 117479966
chr7 117479966 117479967
chr7 117479967 117479968
chr7 117479968 117479969

188700 rows x 3 columns

I try performing simple inner join,

new_table <- inner_join(table1, table2, by = c("chr" = "chr"))

and get the following error

Error: cannot allocate vector of size 132.7 Gb

Based on solutions suggested here R memory management / cannot allocate vector of size n Mb

I tried,

gc()

and

memory.size(max = TRUE)

but neither of these solutions worked. More importantly, I'm trying to understand why R thinks allocating 132.7 Gb is necessary for such a small join operation.

  • 2
    I'm guessing from the `start****` and `end****` columns that you are going to follow up with some sort of filtering operation, perhaps to find overlaps within matching `chr` values? If so, it will be more memory efficient and faster to do a non-equi join using `data.table` or `sqldf` then doing a big join where every `ch7` value in each table is matched with all the same values in the other, effectively squaring the number of rows per `chr7` value and making potentially a very large table. In the extreme case, if every row was `chr7`, your cartesian join would have 35 billion rows. – Jon Spring Feb 22 '22 at 16:53
  • 3
    A vector of 132.7 GB is indeed a large vector. In my experience, even an 8GB vector cannot be created (on my lap top. Depends on RAM). As @JonSpring pointed out. Your data have many repeating values in join keys e.g. `ch7`. Now think of an extreme example where both your dfs have `ch7` in both join keys only. This boils down to a cross join where the number of rows resulted is `nrow(df1) * nrow(df2)`. In your case 188,700^2 = 35,607,690,000 Thus, not as small a join as it seemed. – Sweepy Dodo Feb 22 '22 at 17:02
  • 1
    Suggested duplicates [Merge (join) data frames - too many rows in result](https://stackoverflow.com/q/35956108/903061), or [INNER JOIN in MySQL returns multiple entries of the same row](https://stackoverflow.com/q/50750381/903061) – Gregor Thomas Feb 22 '22 at 17:09

0 Answers0