0

There are 2 huge datasets which I need to merge (left_join). The datasets have 15mio rows each.

I tried different ways to merge them, but there were always error messages:

For example, dataset X has variables NAME and value_Q1, dataset Y has variables NAME and value_Q2.

Try1 <- left_join(X, Y)
  • Error: can not allocate vector of size 19251.5 Gb
Try2 <- merge(X, Y, by = "NAME", all.x =TRUE)
  • Error: negative length vectors are not allowed
Try3_1 <- merge(X[1:10000000,], Y, by = "NAME", all.x =TRUE)

- worked

Try3_2 <- merge(X[10000001:nrow(X),], Y, by = "NAME", all.x =TRUE)
  • Error: negative length vectors are not allowed

The result of Try3_2 surprised me the most, since Try3_1 worked with 10mio rows, but Try3_2 didn't with only 5mio rows. How does merge in R actually work?

In the end, I also tried the code above with data tables, the error message says that results are in more than 2^31 rows.

user438383
  • 5,716
  • 8
  • 28
  • 43
  • Going from 15mio rows to 2^31 sounds like you are getting cartesian-like joins, those are extremely difficult with `base::merge`, `dplyr::*_join`, or `data.table`-merges. How many rows resulted from `Try3_1`? Are you omitting keys on which to join? – r2evans Sep 16 '21 at 19:32
  • 1
    BTW, you have `X` (upper-case) yet you are doing `:nrow(x)` (lower-case). Perhaps that is causing problems? – r2evans Sep 16 '21 at 19:33
  • Very related, likely this is a duplicate: https://stackoverflow.com/q/42479854/3358272. Check `sum(duplicated(X[,c("NAME")])` (and `Y`) to see if/how-many duplicate rows you may have. – r2evans Sep 16 '21 at 19:53
  • In Try3_1 there are 10mio rows in the results. I only need the values from Y which matches with X, so a left join took place. – Yang Wu Sep 17 '21 at 07:32
  • 2. nrow(x) was a typing mistake in my question, in the program I have it correct. 3. There is no duplicates neither in X nor in Y. – Yang Wu Sep 17 '21 at 07:38
  • I fixed your mistake in the question code for you, it is really not good to have known typos in your code, new people hoping to answer you are likely to think the same thing I am: if *you* have not tested the code you share with us here, why should we? Good luck. – r2evans Sep 17 '21 at 12:43

1 Answers1

0

Please ensure that there is no NA rows in your data. I just found out that there were empty rows in the data. After deleting theses rows, the code worked!