5

I just tried merging two tables in R 3.0.1 on a machine with 64G ram and got the following error. Help would be appreciated. (the data.table version is 1.8.8)

Here is what my code looks like:

library(parallel)
library(data.table)

data1: several million rows and 3 columns. The columns are tag, prod and v. There are 750K unique values of tag, anywhere from 1 to 1000 prods per tag, 5000 possible values for prod. v takes any positive real value.

setkey(data1, tag)
merge (data1, data1, allow.cartesian=TRUE)

I get the following error:

Error in vecseq(f_, len_, if (allow.cartesian) NULL else as.integer(max(nrow(x), : Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including j and dropping by (by-without-by) so that j runs for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice. Calls: merge -> merge.data.table -> [ -> [.data.table -> vecseq

new example showing by-without-by

country = fread("
country product share
1 5 .2
1 6 .2
1 7 .6
2 6 .3
2 7 .1
2 8 .4
2 9 .2
")
prod = fread("
prod period value
5 1990 2
5 1991 3
5 1992 2
5 1993 4
5 1994 3
5 1995 5
6 1990 1
6 1991 1
6 1992 0
6 1993 4
6 1994 8
6 1995 2
7 1990 3
7 1991 3
7 1992 3
7 1993 4
7 1994 7
7 1995 1
8 1990 2
8 1991 4
8 1992 2
8 1993 4
8 1994 2
8 1995 6
9 1990 1
9 1991 2
9 1992 4
9 1993 4
9 1994 5
9 1995 6
")

It seems entirely impossible to selected the subset of markets that share a country tag, find the covariances within those pairs, and collate those by country without running up against the size limit. Here is my best shot so far:

setkey(country,country)
setkey(prod, prod, period)
covars <- setkey(setkey(unique(country[country, allow.cartesian=T][, c("prod","prod.1"), with=F]),prod)[prod, allow.cartesian=T], prod.1, period)[prod, ] [ , list(pcov = cov(value,value.1)), by=list(prod,prod.1)] # really long oneliner that finds unique market pairs from the the self-join, merges it with the second table and calculates covariances from the merged table.
clevel <-setkey(country[country, allow.cartesian = T], prod, prod.1)[covars, nomatch=0][ , list(countryvar = sum(share*share.1*pcov)), by="country"]
> clevel
   country countryvar
1:       1   2.858667
2:       2   1.869667

When I try this approach for any reasonable size of data, I run up against the vecseq error. It would be really nice if data.table did not balk so much at the 2^31 limit. I am a fan of the package. Suggestions on how I can use more of the j specification would also be appreciated. (I am not sure how else to try the J specification given how I have to compute variances from the the intersection of the two data tables).

user2627717
  • 344
  • 3
  • 14
  • I think Matthew wrote this error message after I asked [this question](http://stackoverflow.com/q/11610562/1412059). I think the message explains well, what happens. – Roland Aug 07 '13 at 11:42
  • According to the R 'Memory-limits' documentation, it isn't possible to allocate a vector of length longer than 2^31-1. Tell us what join you wanna do. You should never use carthesian join... – statquant Aug 07 '13 at 12:49
  • @Roland: Thanks but I am not sure it is the same error. It is true that my keys are not unique, but the join is not misspecified. My goal is pairing up the prod values for each tag. – user2627717 Aug 07 '13 at 12:49
  • @statquant: True for older versions of R. I thought R.3.0 fixed that. It claims to. – user2627717 Aug 07 '13 at 12:51
  • You are using a 64 bit version of R>=3.0.0... right it is supposed to work, then I guess the message is a bit old, but the fact is that this amount of data is unlikely to fit in RAM... – statquant Aug 07 '13 at 12:52
  • You claim to have 750K tags and let's say 500 values per tag. That means you'll end up with 750k*500^2 rows. Do you really believe that's a sensible merge? – Roland Aug 07 '13 at 12:54
  • I know it is an absurdly large merge... but I would really like to keep it that way if possible. @Roland - off the cuff, I expect around 250e6 rows when the merge is done. – user2627717 Aug 07 '13 at 13:03
  • @statquant how do you figure that? 48 bytes per double, means 24*3*48 = 3456 million bytes required or 3.5 GB. There is plenty of RAM in the system for that. – Simon O'Hanlon Aug 07 '13 at 13:04
  • @SimonO101: After the merge, I should have 5 columns, tag prod1 prod2 v1 v2. Combined with 250e6 observations, I get 250e6*5*48 = 60GB. I thought having 64G of RAM would take care of that. – user2627717 Aug 07 '13 at 13:15
  • @user2627717 OS overhead, other background processes, also data.table requires *up to* one column's worth of working memory to perform these operations if I remember correctly so you are going to be a bit shy of what would be required I think – Simon O'Hanlon Aug 07 '13 at 13:33
  • @user2627717 and you are lucky to work with data.table, should that be a data.frame you would need at least 3x the memory the resulting table would take. Really usually when there is a cross join somewhere one can work out a smarter way. – statquant Aug 07 '13 at 14:09
  • Thanks. I just jacked up the memory, still no dice. Let me try recompiling the data.table package... unless someone has another suggestion. – user2627717 Aug 07 '13 at 14:13
  • If you're going to do this kind of stuff a lot, you may want to consider using an actual database, rather than faking it with R. – Hong Ooi Aug 14 '13 at 06:36
  • 1
    I am sitll getting this error in 2020, has data table not updated? – wolfsatthedoor Apr 07 '20 at 07:23

2 Answers2

4

R 3.0.1 supports objects having lengths greater than 2^31 - 1. While the packages that come with base R can already create such objects, whether contributed packages can do the same depends on the package. Basically, any package using compiled code would have to be recompiled and possibly modified to take advantage of this feature.

Also, assuming that 64GB RAM is enough to work with 60GB objects is kind of optimistic.

Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
  • 2
    +1 `data.table` needs to be tweaked to use R's new long vector features, which would be great. But for this question it's very likely the cartesian join isn't needed, iiuc. – Matt Dowle Aug 07 '13 at 14:19
  • 1
    should I report this as a bug/feature request, or would that be unnecessary? – user2627717 Aug 07 '13 at 14:46
  • @user2627717 Unnecessary. I still think this problem can be done without needing large vectors, by following the suggestions in the error message. – Matt Dowle Aug 08 '13 at 05:03
  • please see updated question with example. The paired table at the end of the example is why I need the self-join. – user2627717 Aug 10 '13 at 01:12
  • @user2627717 was this ever resolved? – wolfsatthedoor Jun 21 '22 at 13:38
  • @MattDowle was this ever resolved? – wolfsatthedoor Jun 21 '22 at 17:50
  • @wolfsatthedoor it seems I investigated at the time and wrote a detailed answer above showing that the join appeared to misspecified. So I'm not sure what there is for me to resolve. – Matt Dowle Jun 23 '22 at 04:08
  • @MattDowle I meant with the 2^31 error from data table. I have a properly specified join that hits this limit (but not memory limit). See for example: https://stackoverflow.com/questions/61091313/are-data-tables-with-more-than-231-rows-supported-in-r-with-the-data-table-pack and https://stackoverflow.com/questions/72707188/data-table-conditional-cross-join-with-two-conditions/72711558#72711558 – wolfsatthedoor Jun 23 '22 at 15:12
3

This join indeed seems to be misspecified. In general, I think, a self join of a table with single column key is probably always misspecified. Consider the following example :

> DT
   A B
1: 1 5
2: 1 6
3: 1 7
4: 2 8
5: 2 9
> setkey(DT,A)

There are 2 unique values of A (1 and 2), but they are repeated in the A column. The key is a single column.

> DT[DT]   # the long error message

> DT[DT, allow.cartesian=TRUE]  # **each row** of DT is self joined to DT
    A B B.1
 1: 1 5   5
 2: 1 6   5
 3: 1 7   5
 4: 1 5   6
 5: 1 6   6
 6: 1 7   6
 7: 1 5   7
 8: 1 6   7
 9: 1 7   7
10: 2 8   8
11: 2 9   8
12: 2 8   9
13: 2 9   9

Is this really the result you need? More likely, the query needs to be changed by adding more columns to the key, doing a by instead, not doing a self join or the hints in the error message.

More information about what you need to achieve after the merge (bigger picture) is likely to help.

example of "including j and dropping by (by-without-by) so that j runs for each group to avoid the large allocation" (see error message in question)

The example now in question (covariance) is normally done with matrices rather than data.table. But proceeding anyway to quickly demonstrate, just using it as example data ...

require(data.table)
country = fread("
Country Product
1 5
1 6
1 7
2 6
2 7
2 8
2 9
")
prod = fread("
Prod1 Prod2 Covariance
5 5 .4
5 6 .5
5 7 .6
5 8 -.3
5 9 -.1
6 6 .3
6 7 .2
6 8 .4
6 9 -.2
7 7 .2
7 8 .1
7 9 .3
8 8 .1
8 9 .6
9 9 .01
")

.

country
   Country Product
1:       1       5
2:       1       6
3:       1       7
4:       2       6
5:       2       7
6:       2       8
7:       2       9
prod
    Prod1 Prod2 Covariance
 1:     5     5       0.40
 2:     5     6       0.50
 3:     5     7       0.60
 4:     5     8      -0.30
 5:     5     9      -0.10
 6:     6     6       0.30
 7:     6     7       0.20
 8:     6     8       0.40
 9:     6     9      -0.20
10:     7     7       0.20
11:     7     8       0.10
12:     7     9       0.30
13:     8     8       0.10
14:     8     9       0.60
15:     9     9       0.01

.

setkey(country,Country)
country[country,{print(.SD);print(i.Product)}]
# included j to demonstrate j running for each row of i. Just printing to demo.
   Product
1:       5
2:       6
3:       7
[1] 5
   Product
1:       5
2:       6
3:       7
[1] 6
   Product
1:       5
2:       6
3:       7
[1] 7
   Product
1:       6
2:       7
3:       8
4:       9
[1] 6
   Product
1:       6
2:       7
3:       8
4:       9
[1] 7
   Product
1:       6
2:       7
3:       8
4:       9
[1] 8
   Product
1:       6
2:       7
3:       8
4:       9
[1] 9
Empty data.table (0 rows) of 2 cols: Country,Product

.

setkey(prod,Prod1,Prod2)
country[country,prod[J(i.Product,Product),Covariance,nomatch=0]]
    Country Prod1 Prod2 Covariance
 1:       1     5     5       0.40
 2:       1     5     6       0.50
 3:       1     5     7       0.60
 4:       1     6     6       0.30
 5:       1     6     7       0.20
 6:       1     7     7       0.20
 7:       2     6     6       0.30
 8:       2     6     7       0.20
 9:       2     6     8       0.40
10:       2     6     9      -0.20
11:       2     7     7       0.20
12:       2     7     8       0.10
13:       2     7     9       0.30
14:       2     8     8       0.10
15:       2     8     9       0.60
16:       2     9     9       0.01

country[country, prod[J(i.Product,Product),Covariance,nomatch=0][
    ,mean(Covariance),by=Country]
   Country        V1
1:       1 0.3666667
2:       2 0.2010000

This doesn't match the desired result due to not doubling the off diagonal. But hopefully this is enough to demonstrate that particular suggestion in the error message in the question and you can take it from here. Or use matrix rather than data.table for covariance type work.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • This is exactly the result I need. I am trying to study interactions between prods that share a tag. This is what leads me to the error message above. Your help would be appreciated. – user2627717 Aug 07 '13 at 14:30
  • B and B.1 in your example give me pairs of those prods, for which i can estimate correlations etc. – user2627717 Aug 07 '13 at 14:32
  • So it looks like we are back to tweaking data.table. – user2627717 Aug 07 '13 at 14:45
  • 2
    @user2627717 Did you try "try including j and dropping by (by-without-by) so that j runs for each group to avoid the large allocation" in the error message? I doubt we need large vectors for this. – Matt Dowle Aug 08 '13 at 04:58
  • I tried creating a copy of the table, and using j this way: tablecopy <- tablecopy[data1, list(prod1=data$prod,prod2=tablecopy$prod),allow.cartesian=T]. I got a segfault "memory not mapped" error. – user2627717 Aug 08 '13 at 08:06
  • okay. I must admit I blundered that data copy. (Assigned instead of copied). When I finally did the copy and ran as described above. I get back to the vecseq error. I really need this self-join if possible. – user2627717 Aug 08 '13 at 08:42
  • @user2627717 Can you please show a small example showing the input and output; i.e., show what "study interactions between prods that share a tag" means. Then I can show the syntax to get this done. – Matt Dowle Aug 08 '13 at 19:53
  • Let's start with your example. Say A is a country, and B are products the country exports. Let's add another variable C that is the value of those exports. Then the expanded table tells me two things, which products are co-produced by the same country, and the correlation between C and C.1 tell me whether exports of B and B.1 balance each other out, or worsen the variability of total exports. Does this help clarify what I'd like to do? – user2627717 Aug 09 '13 at 00:27
  • user2627717 That doesn't help at all I'm afraid. It's English. It isn't a small example showing input and output. – Matt Dowle Aug 09 '13 at 08:45
  • Try this: country 1 produces goods 5to7, country2 produces goods 6to9. There is a variance covariance table that shows how demand for each good relates to others, say: 5-5 .4, 5-6 .5, 5-7 .6, 5-8 -.3, 5-9 -.1, 6-6 .3, 6-7 .2, 6-8 .4, 6-9 -.2, 7-7 .2, 7-8 .1, 7-9 .3, 8-8 .1, 8-9 .6, 9-9 .01. I want a table that shows country 1 and 2's aggregate covariance numbers. That covariance must be estimated over all pairs sold by the country. Hence my need for that really large table. In this case country 1's covariance would be avg(.4 + .3 +.2 + 2*(.5+.6+.2)). – user2627717 Aug 09 '13 at 14:18
  • I would attach a larger table, but not sure if it would add more clarity. Size is built into the nature of this problem - so for country 1, what I need is not just the list 5-7, but 5-5, 6-6, 7-7 and all the interactions between the three goods – user2627717 Aug 09 '13 at 14:20
  • @user2627717 At this point I give up. Please take the time to look at other people's questions and hopefully that will help. – Matt Dowle Aug 09 '13 at 21:32
  • 1
    I just updated the question with a clear example like you requested. I hope I could task your patience that much. Thanks. – user2627717 Aug 10 '13 at 01:13
  • @user2627717 Perfect - much better, thanks. Now I see. Didn't before. +1 Will answer later ... – Matt Dowle Aug 10 '13 at 08:36
  • Great. Thanks. Will keep my fingers crossed. – user2627717 Aug 11 '13 at 17:39
  • @user2627717 Edit added. Hope it makes some sense. – Matt Dowle Aug 12 '13 at 10:58
  • Thanks! It looks great. I'll poke around it a bit to see if I can use it to create the covariance table - which is the first step. This is much better than where I started though. – user2627717 Aug 13 '13 at 18:53
  • So I tried it with the more complex version of the problem I presented (see edit to question). The size issue cropped up again... – user2627717 Aug 13 '13 at 19:31
  • @user2627717. Ok. Happy to help. But are you expecting me to guess the command you wrote? – Matt Dowle Aug 13 '13 at 22:32
  • It seems that the compound query still runs up against the vector size limit in data.table for large datasets. – user2627717 Aug 14 '13 at 15:35
  • (Though it is quite possible that I did not code it quite right). – user2627717 Aug 14 '13 at 15:41
  • @user2627717 Thanks. Looking at your code in the new edit, it seems there are still joins where `j` is missing. To specify `j` is one of the hints in the long error message and is what I illustrated further in my answer. – Matt Dowle Aug 16 '13 at 13:08
  • @user2627717 I've looked. I don't see why you don't see why `country[country, allow.cartesian=T]` is the problem. There's no `j` there. I've tried to explain in my edits to my answer above, and reading again now I think they're useful. The one thing I can suggest is you post to datatable-help (link on homepage) with a link to this question. Perhaps someone else can help grease the communication cogs here. – Matt Dowle Aug 28 '13 at 00:24
  • I agree, the communication may need help. I should have clearly explained that your example assumed that some 'prod' table with covariances already exists. I have found no way to create that table without a cartesian merge. (The covariance table would be really large btw- prob with nrow > 2^31). How should I explain that the very useful example you provided still misses out the most critical component of the new example I laid out? – user2627717 Aug 28 '13 at 09:12
  • What I am really asking for is a hint on how to make the prod table in your example without hitting the vecseq limit. I am grateful for your help. – user2627717 Aug 28 '13 at 09:23
  • 2
    @user2627717 If covariance is the real task then why not use `matrix`? I'm not sure `data.table` is the right tool for you. `matrix` was restricted to nrow*ncol < 2^31 too but R has recently introduced long vectors to relax that limit, so perhaps you need that new feature. – Matt Dowle Aug 28 '13 at 10:03