Solution
After reading the comments posted by
@Jon Spring, @Sweep Dodo and @Gregor Thomas carefully, I realized the issue was a large number of duplicate keys in the chr column. I simply created new columns by pasting chr:start together for both tables so there would be a lot fewer duplicate entries. After this the inner_join
completes in a couple of seconds.
Issue
Running 64-bit R 4.1.2 on Windows 11.
I have two tables
Table 1:
chr | start | end |
---|---|---|
chr7 | 117120017 | 117120018 |
chr7 | 117120018 | 117120019 |
chr7 | 117120019 | 117120020 |
chr7 | 117120020 | 117120021 |
chr7 | 117120021 | 117120022 |
chr7 | 117120022 | 117120023 |
188700 rows x 3 columns
Table 2:
chr | starthg38 | endhg38 |
---|---|---|
chr7 | 117479963 | 117479964 |
chr7 | 117479964 | 117479965 |
chr7 | 117479965 | 117479966 |
chr7 | 117479966 | 117479967 |
chr7 | 117479967 | 117479968 |
chr7 | 117479968 | 117479969 |
188700 rows x 3 columns
I try performing simple inner join,
new_table <- inner_join(table1, table2, by = c("chr" = "chr"))
and get the following error
Error: cannot allocate vector of size 132.7 Gb
Based on solutions suggested here R memory management / cannot allocate vector of size n Mb
I tried,
gc()
and
memory.size(max = TRUE)
but neither of these solutions worked. More importantly, I'm trying to understand why R thinks allocating 132.7 Gb is necessary for such a small join operation.