1

I am trying to do a conditional cross join in data table, and I am running into this error:

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
  Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

After doing some sleuthing, I still cannot find a solution here. I have more than 1TB of RAM and memory is not the issue. Below is a reproducible example, as you scale up N eventually the code will give this error.

N=10000
J=50
dat=data.table(CJ('t'=1:N,'a'=1:N,'j'=1:5))
dat2 = data.table(CJ('j_prime'=1:J,'t_prime'=1:N))
datfinal = dat[, k:=(t+1)][dat2[, k:=t_prime], on=.(k), nomatch=0L,allow.cartesian=TRUE][,k:=NULL]
    
    
wolfsatthedoor
  • 7,163
  • 18
  • 46
  • 90
  • 1
    Some interesting discussion on this at [here](https://stackoverflow.com/questions/11610562/merging-data-tables-uses-more-than-10-gb-ram) and [here](https://stackoverflow.com/questions/18102042/join-results-in-more-than-231-rows-internal-vecseq-reached-physical-limit). – Adam Quek Jun 21 '22 at 05:54
  • @AdamQuek neither seem to resolve – wolfsatthedoor Jun 21 '22 at 13:02
  • The error says `more than 2^31 rows`, which agrees with the expectation of lots of data. While R does support some use of over-`2^31` when indexing, I don't know that `data.frame`s have gotten there yet. Unfortunately, until R supports frames with more than 2^31 rows, you may need to look at subsetting the problem and/or choosing another storage/processing mechanism. – r2evans Jun 21 '22 at 16:13
  • Any workaround solutions welcome then @r2evans – wolfsatthedoor Jun 21 '22 at 19:24
  • Sorry, I already suggested what I could think of given what I know of your data. Anything more than subsetting (splitting the data into smaller chunks) or another storage mechanism (e.g., SQL) will likely require a solution that is specific to the context of the data and the remaining processing you have to do on it. – r2evans Jun 21 '22 at 19:48
  • 1
    @r2evans Started a bounty here: https://stackoverflow.com/questions/61091313/are-data-tables-with-more-than-231-rows-supported-in-r-with-the-data-table-pack – wolfsatthedoor Jun 21 '22 at 20:47

0 Answers0