8

I am trying to do a cross join (from the original question here), and I have 500GB of ram. The problem is that the final data.table has more than 2^31 rows, so I get this error:

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
  Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

Is there a way to override this? When I add by=.EACHI, I get the error:

  'by' or 'keyby' is supplied but not j

I know this question is not in ideal reproducible format (my apologies!), but I am not sure that is strictly necessary for an answer. Maybe I am just missing something or is limited in this way?

I am aware only of this question from 2013, which seems to suggest data.table could not do this back then.

This is the below code that causes the error:

  pfill=q[, k:=t+1][q2[, k:=tprm], on=.(k), nomatch=0L,allow.cartesian=TRUE][,k:=NULL]
Waldi
  • 39,242
  • 6
  • 30
  • 78
wolfsatthedoor
  • 7,163
  • 18
  • 46
  • 90
  • did you pass in `allow.cartesian=TRUE`? Can you show your code that causes this error? – chinsoon12 Apr 08 '20 at 00:04
  • hi @wolfsatthedoor, do you really need all the rows from the join? Or can you add in one more joining jey? it is probably a many-to-many join causing the huge allocation required. i think there are some discussions on this in github/rdatatable you might want to check out there – chinsoon12 Apr 08 '20 at 00:13
  • @chinsoon12 I really do need all the rows unfortunately. Is data.table just stumped for over any data table with more than 2 billion rows? – wolfsatthedoor Apr 08 '20 at 00:32
  • You might need to search the github for discussions as I don’t have access right now – chinsoon12 Apr 08 '20 at 00:38
  • 1
    see https://github.com/Rdatatable/data.table/issues/3957 – jangorecki Apr 08 '20 at 18:55

2 Answers2

8

As data.table seems to be still limited to 2^31 rows, you could as a workaround use arrow combined with dplyr to overcome this limit:

library(arrow)
library(dplyr)

# Create 3 * 2^30 rows feathers
dt <-data.frame(val=rep(1.0,2^30))
write_feather(dt, "test/data1.feather")
write_feather(dt, "test/data2.feather")
write_feather(dt, "test/data3.feather")

# Read the 3 files in a common dataset
dset <- open_dataset('test',format = 'feather')

# Get number of rows
(nrows <- dset %>% summarize(n=n()) %>% collect() %>% pull)
#integer64
#[1] 3221225472

# Check that we're above 2^31 rows
nrows / 2^31
#[1] 1.5
Waldi
  • 39,242
  • 6
  • 30
  • 78
  • 1
    Pretty nice workaround, first time to know the package `arrow`, super! +1! – ThomasIsCoding Jun 24 '22 at 05:33
  • @ThomasIsCoding, came to `arrow` from [disk.frame](https://github.com/DiskFrame/disk.frame) : {disk.frame} has been soft-deprecated in favor of {arrow} – Waldi Jun 24 '22 at 06:57
  • 3
    The answer cretes _an object_ with more than 2^31 rows, but that _object_ is not a `data.frame` or `data.table`. As such, it appears to more or less miss the question. – Dirk Eddelbuettel Jun 24 '22 at 21:20
  • 3
    @DirkEddelbuettel, this is an anwer to OP's request for a woraround, allowing data manipulation for high number of rows. It might fill a gap between `data.table` rows limit and big data manipulation like [sparklyr](https://spark.rstudio.com/) – Waldi Jun 25 '22 at 07:09
3

If you just want to know Yes or No, I guess we cannot have a data.table object with more than 2^31 rows if you stick to be within data.table only. However, if you jump out of data.table, the answer by @Waldi is a fabulous workaround for this issue.

The explanation below is just an example to somewhat "prove" the infeasibility, which may provide you with some hints, hopefully.


Let's think about it in the other way around. Assuming we have a data.table dt with more then 2^31 rows, what will happen when indexing the rows? It should be noted that we use integers to index the rows, that means we need to support integers larger than 2^31 in your case. Unfortunately, if you type ?.Machine in the console, you will see that

The algorithm is based on Cody's (1988) subroutine MACHAR. As all current implementations of R use 32-bit integers and use IEC 60559 floating-point (double precision) arithmetic, the "integer" and "double" related values are the same for almost all R builds.

and

integer.max the largest integer which can be represented. Always 2^31 - 1 = 2147483647.

If the assumption is true, then we come to indexing issues, i.e., invalid indexing. Thus the assumption does not hold.


A Simple Test

Given a long vector v of length 2^31 (which is larger than 2^31-1), let's see what will happen if we use it to initialize a data.table:

> v <- seq_len(2^31)

> d <- data.table(v)
Error in vapply(X = x, FUN = fun, ..., FUN.VALUE = NA_integer_, USE.NAMES = use.names) : 
  values must be type 'integer',
 but FUN(X[[1]]) result is type 'double'

As we can see, there is no issue when create a vector of length 2^31, but we have troubles when initializing a data.table d. When we look into the source code of data.table, we see there are several places using length, which is applicable when the vector is not longer than 2^31-1

The default method for length currently returns a non-negative integer of length 1, except for vectors of more than 2^{31}-1 elements, when it returns a double. and we can check that

> class(length(v))
[1] "numeric"

which means the output is not integer as required when calling data.table.

ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
  • AFAIK `data.table` is written in C++. Do R integer indexing limitations apply to C++? – Waldi Jun 22 '22 at 08:54
  • 2
    @Waldi When looking into the the source code `data.table` (just type `data.table` in the console), we will see that there are some functions like `length` and `nrow`, which might be the limitations for indexing, for example. – ThomasIsCoding Jun 22 '22 at 09:12
  • 1
    Agree, also relevant : `data.table` is a `data.frame` https://stackoverflow.com/questions/5233769/practical-limits-of-r-data-frame – Waldi Jun 22 '22 at 12:15
  • @Waldi thanks, it's a really nice reference – ThomasIsCoding Jun 22 '22 at 13:04
  • What is a workaround then? – wolfsatthedoor Jun 23 '22 at 03:11
  • @wolfsatthedoor I don’t think there is a workaround – ThomasIsCoding Jun 23 '22 at 04:24
  • 3
    I'm having trouble understanding the argumentation in your answer. R doesn't require an index to be of type integer: `x <- seq_len(2^32); x[2^32]` (see also: https://github.com/hadley/r-internals/blob/master/vectors.md) Why would data.table need to require it? We can have larger data.tables. The constraint is developer resources. – Roland Jun 24 '22 at 06:52
  • @Roland Thanks for your feedback. The issue comes from the mechanism of building a data.table object (see the example in my update). The vector `v` is fine, but we cannot use it to create a data.table if its length is more than `2^31-1`. – ThomasIsCoding Jun 24 '22 at 21:16
  • 1
    That’s nothing that can’t be fixed. It‘s not a fundamental limitation of R. – Roland Jun 26 '22 at 13:43
  • 1
    @Roland Yes, I agree. It can be fixed but might be a limitation to `data.table` or `data.frame` here. – ThomasIsCoding Jun 26 '22 at 19:06