I have 2 data.tables:
- train.o with 160057 rows, column id is unique
- train.t with 187105085 rows, column id is not unique, there are 160057 unique id, all in train.o
I want to merge it, to have a single data.table with 187105085 rows.
When I type
> setkey(train.o,id)
> setkey(train.t,id)
> trainbig=merge(train.t,train.o)
I get
Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), : Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including
j
and droppingby
(by-without-by) so that j runs for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
What do I wrong?
EDIT:
> head(train.o)
offer id chain market repeattrips repeater offerdate category quantity company
1: 1197502 86252 205 34 16 t 2013-03-27 3203 1 106414464
2: 1208251 86246 205 34 5 t 2013-04-24 2202 1 104460040
3: 1197502 12682470 18 11 0 f 2013-03-28 3203 1 106414464
4: 1197502 12996040 15 9 0 f 2013-03-25 3203 1 106414464
5: 1204821 13089312 15 9 0 f 2013-04-01 5619 1 107717272
6: 1197502 13179265 14 8 0 f 2013-03-29 3203 1 106414464
offervalue brand
1: 0.75 13474
2: 2.00 3718
3: 0.75 13474
4: 0.75 13474
5: 1.50 102504
6: 0.75 13474
> head(train.t)
id chain dept category company brand date productsize productmeasure
1 86246 205 7 707 1078778070 12564 2012-03-02 12 OZ
2 86246 205 63 6319 107654575 17876 2012-03-02 64 OZ
3 86246 205 97 9753 1022027929 0 2012-03-02 1 CT
4 86246 205 25 2509 107996777 31373 2012-03-02 16 OZ
5 86246 205 55 5555 107684070 32094 2012-03-02 16 OZ
6 86246 205 97 9753 1021015020 0 2012-03-02 1 CT
purchasequantity purchaseamount
1 1 7.59
2 1 1.59
3 1 5.99
4 1 1.99
5 2 10.38
6 1 7.80