Join with non-unique keys, unique in i

Question

I have a data table with non-unique keys:

> dput(sv)
structure(list(kwd = c("a", "a", "b", "b", "c"), pixel = c(1,
2, 1, 2, 2), kpN = c(2L, 2L, 2L, 1L, 1L)), row.names = c(NA,
-5L), class = c("data.table", "data.frame"), .Names = c("kwd",
"pixel", "kpN"), .internal.selfref = <pointer: 0x7fc4aa800778>, sorted = "kwd")
> dput(kwd)
structure(list(kwd = c("a", "b", "c", "z"), kwdN = c(3L, 2L,
1L, 1L)), row.names = c(NA, -4L), class = c("data.table", "data.frame"
), .Names = c("kwd", "kwdN"), .internal.selfref = <pointer: 0x7fc4aa800778>, sorted = "kwd")

why am I getting this error:

> sv[kwd,kwdN:=kwdN]
Starting bmerge ...done in 0 secs
Error in vecseq(f__, len__, if (allow.cartesian || notjoin) NULL else as.integer(max(nrow(x),  :
  Join results in 6 rows; more than 5 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
Calls: [ -> [.data.table -> vecseq

I expected something like this (note that the keys are :

   kwd pixel kpN kwdN
1:   a     1   2    3
2:   a     2   2    3
3:   b     1   2    2
4:   b     2   1    2
5:   c     2   1    1

Moreover, I am pretty sure that it worked like that before.

Is this something that changed in data.table 1.9.4?

How do I get what I want? (kwd[sv] appears to work, is that the new way?)

`allow.cartesian` error shouldn't popup here. This has been fixed in 1.9.5. Check point 8 under bug fixes for 1.9.5 [here](https://github.com/Rdatatable/data.table/blob/master/README.md). When `i` has duplicates, then as the error message already says, you should use `allow.cartesian=TRUE`. — Arun, Oct 29 '14 at 16:03
Not sure what you're trying to say. I agreed this shouldn't happen, and showed you the link that the issue has been fixed since. — Arun, Oct 29 '14 at 16:31
@Arun: I thought you meant that 1.9.5 was already released. I see that this is not the case, it's a development version. When is 1.9.6 expected? Thanks. — sds, Oct 29 '14 at 16:35

score 1 · Answer 1 · edited May 23 '17 at 12:05

Just so this remains answered:

allow.cartesian functionality was implemented after this post from @Roland. Also refer to this post for additional explanation.

Cases where allow.cartesian is not necessary (and therefore should not error) are:

when i has no duplicates #742 - this was not checked correctly before. Fixed in 1.9.5 (current development version).
When j has := #800 - the number of rows will never exceed x. Fixed in 1.9.5 (current development version).
When the operation is a not-join (or anti-join), #698 - the number of rows will never exceed x once again. Fixed in 1.9.4.

In summary, allow.cartesian error occurs only where necessary. The fixes that were made in 1.9.5 would become available when 1.9.6 is released on CRAN (should be very soon now).

Join with non-unique keys, unique in i

1 Answers1