2

I have 2 data.tables:

  • train.o with 160057 rows, column id is unique
  • train.t with 187105085 rows, column id is not unique, there are 160057 unique id, all in train.o

I want to merge it, to have a single data.table with 187105085 rows.

When I type

> setkey(train.o,id)
> setkey(train.t,id)
> trainbig=merge(train.t,train.o)

I get

Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), : Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including j and dropping by (by-without-by) so that j runs for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

What do I wrong?

EDIT:

> head(train.o)
 offer       id chain market repeattrips repeater  offerdate category quantity   company
1: 1197502    86252   205     34          16        t 2013-03-27     3203        1 106414464
2: 1208251    86246   205     34           5        t 2013-04-24     2202        1 104460040
3: 1197502 12682470    18     11           0        f 2013-03-28     3203        1 106414464
4: 1197502 12996040    15      9           0        f 2013-03-25     3203        1 106414464
5: 1204821 13089312    15      9           0        f 2013-04-01     5619        1 107717272
6: 1197502 13179265    14      8           0        f 2013-03-29     3203        1 106414464
   offervalue  brand
1:       0.75  13474
2:       2.00   3718
3:       0.75  13474
4:       0.75  13474
5:       1.50 102504
6:       0.75  13474

> head(train.t)
     id chain dept category    company brand       date productsize productmeasure
1 86246   205    7      707 1078778070 12564 2012-03-02          12             OZ
2 86246   205   63     6319  107654575 17876 2012-03-02          64             OZ
3 86246   205   97     9753 1022027929     0 2012-03-02           1             CT
4 86246   205   25     2509  107996777 31373 2012-03-02          16             OZ
5 86246   205   55     5555  107684070 32094 2012-03-02          16             OZ
6 86246   205   97     9753 1021015020     0 2012-03-02           1             CT
  purchasequantity purchaseamount
1                1           7.59
2                1           1.59
3                1           5.99
4                1           1.99
5                2          10.38
6                1           7.80
Adam Szałucha
  • 140
  • 3
  • 11
  • At least show us the structure, column names of your data sets and also show us that there are no duplicated elements on the column you're joining. – Arun May 28 '14 at 15:15
  • The data come from [link]http://www.kaggle.com/c/acquire-valued-shoppers-challenge/data -train.o is merge(trainHistory,offers,by="offer") -train.t is subset(transactions,id %in% trainHistory$id) The transactions file can be joined to the history file by (id,chain). But setkey for both id and chain doesn't help. – Adam Szałucha May 28 '14 at 15:39
  • I've also edited post to include heads of both data.tables – Adam Szałucha May 28 '14 at 16:12
  • what does `sum(!duplicated(train.o$id))` and `sum(!duplicated(train.t$id))` give? – Arun May 28 '14 at 16:27
  • Both 160057, I wrote it in the beginning. – Adam Szałucha May 28 '14 at 16:59
  • Okay one last test before I give up. What does `sum(is.na(match(unique_values_from_train.o, unique_values_train.t)))` give? Just to verify if the keys are identical in both. I'm sorry I don't have the time to download GBs of data from Kaggle. The objective should be to produce a minimal example, but that seems to be not the case here. So, this'll be my last attempt. – Arun May 28 '14 at 17:20
  • It seems as though you are just running up against the language limits on the integer class. – IRTFM May 28 '14 at 22:07
  • If you mean unique values of id in both data.tables it gives 0. I'm wondering if it is not a bug in data.table. When I merge in the same way offers and trainHistory to get train.o it works. – Adam Szałucha May 29 '14 at 07:53
  • Especially that plyr's join(train.t,train.o,by="id") succeed. – Adam Szałucha May 29 '14 at 10:47
  • Could you tell what's the class of `id` in all these data.tables? – Arun Jun 02 '14 at 12:48
  • Could you please try again from the current development version from [github](https://github.com/Rdatatable/data.table)? – Arun Sep 10 '14 at 18:39

1 Answers1

-1

Have you tried an unelegant (but effective):

trainbig=merge(train.t,train.o, by=.EACHI, all=T)

and then manually removing the variables you are not interested in?

Nathan Tuggy
  • 2,237
  • 27
  • 30
  • 38
Ileeo
  • 25
  • 1
  • 7