More effective merging of matched column with duplicates in data.table

Question

I've got two data.tables, both of which share one variable; I'm trying to add a variable that's missing from the second, but which is tied one-for-one to the shared variable.

This is clearly a merge, but because the shared variable has multiple instances, I'm having to use what feels like a workaround to merge the new variable.

Let's get specific.

x <- data.table(let = rep(letters[1:3], 2:4),
                num = rep(1:3, 2:4), other = rnorm(9))
y <- data.table(let = rep(c("a", "c"), c(10, 6)))

x:
   let num       other
1:   a   1 -0.41695882
2:   a   1 -0.59875888
3:   b   2 -0.19433915
4:   b   2  0.58406046
5:   b   2 -0.33922321
6:   c   3 -0.63076561
7:   c   3  1.06987710
8:   c   3  0.08869372
9:   c   3 -1.31196123

y:
    let
 1:   a
 2:   a
 3:   a
 4:   a
 5:   a
 6:   a
 7:   a
 8:   a
 9:   a
10:   a
11:   c
12:   c
13:   c
14:   c
15:   c
16:   c

I just want to add the num column to y; since num is matched 1-1 with let, it doesn't really matter that there's duplicates.

Here's an approach that works; I've just got a feeling there's something simpler.

setkey(x, let)
setkey(y, let)

y <- x[!duplicated(let), c("let", "num"), with = FALSE][y]

score 7 · Accepted Answer · edited Oct 19 '21 at 10:27

7

The only improvements that I could think of is that

You could skip the setkey(x, let) part
You could also update y by reference (rather than creating a copy using <- and then assigning back to y)

If you are using the current stable version version of data.table (v <= 1.9.4) you will have to use allow.cartesian = TRUE

setkey(y,let)
y[x[!duplicated(let)], num := i.num, allow.cartesian = TRUE][]

You could alternatively use unique instead of duplicated (they both have data.table methods)

y[unique(x, by = "let"), num := i.num, allow.cartesian = TRUE]

Here's another possibility using the new .EACHI method, although there's no need for the use of by=.EACHI here. I've shown you just to expose this feature for you. Have a look at this post for a detailed explanation of what this does and when it's useful.

y[x, num := unique(i.num), by = .EACHI, allow.cartesian = TRUE]

Edit: (Thanks to @Arun for pointing this out)

We shouldn't need allow.cartesian argument here, as there are no duplicates in i. In fact, it's a bug, #742 that has been fixed in the current development version (1.9.5). So you just need to do:

y[x[!duplicated(let)], num := i.num, on = "let"]
# or
y[unique(x, by = "let"), num := i.num, on = "let"]
# or (though not recommended in this specific case)
y[x, num := unique(i.num), by = .EACHI, on = "let"]

edited Oct 19 '21 at 10:27

DoubleYou

1,057
11
25

answered Dec 30 '14 at 22:43

David Arenburg

91,361
17
137
196

1

Is there any performance difference between `!duplicated()` and `unique()`? I think the latter should also work somehow, but I have no idea if it would be a good option. Nice answer - (+1) – talat Dec 30 '14 at 22:46
@docendodiscimus, yes `data.table` has methods for both, (similar to `distinct` in `dplyr`) see my edit. – David Arenburg Dec 30 '14 at 22:50
Thanks for demonstrating that! Do you think they'll be the same performance-wise? – talat Dec 30 '14 at 22:53
I think that skipping `setkey(x, let)` will improve performance, also the assigning be reference. though I'm not sure how `allow.cartesian = TRUE` performs – David Arenburg Dec 30 '14 at 22:55
I had a feeling the allow.cartesian option could be invoked somehow, but am unfamiliar with cartesian merges and merge options in general--they're not explained very well in the data.table FAQ. Also, to be honest, my approach got much simpler as I was writing this question up--was originally taking another 3-4 steps on the way that I realized were extraneous. – MichaelChirico Dec 30 '14 at 22:56
See my edit. Added `.EACHI` option, though not sure about efficiency – David Arenburg Dec 30 '14 at 23:02
David, no need for `allow.cartesian`. You remove the duplicates in`i`. Your second solution is right on. No need for by=.EACHI either. What'so the advantage here? – Arun Dec 31 '14 at 06:15
@Arun It doesn't work for me without `allow.cartesian` because there are dupes in `y` too... The `by=.EACHI` one is just in order to familiarize the OP with this option (I said there is no performance gain there in the answer) – David Arenburg Dec 31 '14 at 06:32
@DavidArenburg, not sure which version you're on. But this has been fixed a while ago. `allow.cartesian` checks for duplicates in `x` (which you remove), but not in `y`. There should be no error. – Arun Dec 31 '14 at 06:42
@MichaelChirico, have you read about `allow.cartesian` from `?data.table`? In addition, [check this post](http://stackoverflow.com/a/23087759/559784). – Arun Dec 31 '14 at 06:43
@Arun, I have 1.9.4 at home... Will upgrade and check it again in a few – David Arenburg Dec 31 '14 at 06:45
@DavidArenburg, right. It's fixed in [1.9.5](https://github.com/Rdatatable/data.table). – Arun Dec 31 '14 at 06:50
@Arun, made the edit, thanks for the comment. I had to downgrade the version a while ago in order to check some stuff and forgot to upgrade back – David Arenburg Dec 31 '14 at 06:53
With the latest data.table versions the answers only work if you use `y[x[!duplicated(let)], num := i.num, on = "let"]` or `y[unique(x, by = "let"), num := i.num, on = "let"]`. – DoubleYou Oct 19 '21 at 09:06
@DoubleYou ok, thanks. I have no time to check. If you did already, please edit the answer as you like. – David Arenburg Oct 19 '21 at 09:17

score 3 · Answer 2 · edited Apr 18 '20 at 02:12

3

Well, I would use a merge like the following, but I am not sure that it is simpler than what you have already done.

merge(y, unique(x[, c('let','num')]), all.x=TRUE, by='let')

edited Apr 18 '20 at 02:12

MichaelChirico

33,841
14
113
198

answered Dec 30 '14 at 22:35

Nikos

3,267
1
25
32

I don't think it is neither simpler or more efficient – David Arenburg Dec 30 '14 at 22:37
Yes I mention that in my answer, but it is mostly to follow the `merge` approach. – Nikos Dec 30 '14 at 22:41
can't give credence to this given how consistently the merge approach is disparaged in the FAQ and elsewhere by data.table developers... – MichaelChirico Dec 30 '14 at 22:58
@MichaelChirico, the only concern is that `merge.data.table` makes a copy, which might not be always desirable. Other than that, there's no reason. Which FAQ specifically are you referring to? When secondary keys are extended, we'll be able to join even easier, which'll also remove that copy in `merge`. – Arun Dec 31 '14 at 07:02
http://cran.r-project.org/web/packages/data.table/vignettes/datatable-faq.pdf 1.12 is worded against using `merge`; I can't think of where, but I've seen another answer on SE somewhere with similar admonition – MichaelChirico Jan 02 '15 at 17:05

score 2 · Answer 3 · answered Dec 30 '14 at 22:45

2

Agree with @David, difficult to get much simpler. But below trim a few key strokes away :-)

setkey(x,let)
y<-x[!duplicated(let),.(let,num)][y]

answered Dec 30 '14 at 22:45

KFB

3,501
3
15
18

never saw this way of subsetting! is there any efficiency loss vs. using the with=F approach? – MichaelChirico Dec 30 '14 at 22:52
It is the same as `list(let, num)`. It will be probably more efficient than `with= FALSE` (though not tested) – David Arenburg Dec 30 '14 at 22:54
1

`list(let, num)` is same as `[, c("list", "num"), with=FALSE]`. But the second one is more useful when you've a function; you can pass the columns as a character vector to function argument. It's the SE equivalent of `list(let, num)`, + to provide a data.frame-like syntax. – Arun Dec 31 '14 at 07:00
KFB, nice solution, but note that this results in an entirely dataset, which we can prevent by doing `y[x, col :=value]`. – Arun Dec 31 '14 at 07:04
1

@Arun, Right, especially if dataset is big. Otherwise it may not be a practical hindrance. Admitted, I was thinking of the syntax only. After all we love data.table for its speed, memory efficiency and syntax (who said beauty is skin deep?)! Thanks!! – KFB Dec 31 '14 at 07:39

More effective merging of matched column with duplicates in data.table

3 Answers3