6

Is it possible to combine chaining and assignment by reference in a data.table?

For example, I would like to do this:

DT[a == 1][b == 0, c := 2]

However, this leaves the original table unchanged, as a temporary table seems to be created after DT[a == 1] which is subsequently changed and returned.

I would rather not do

DT[a == 1 & b == 0, c := 2]

as this is very slow and I would also rather avoid

 DT <- DT[a == 1][b == 0, c := 2]

as I would prefer to do the assignment by reference. This question is part of the question [1], where it is left unanswered.

[1] Conditional binary join and update by reference using the data.table package

Community
  • 1
  • 1
Hans-Peter Schrei
  • 432
  • 1
  • 6
  • 14
  • I don't see how this question is related to the other one, and you need to show context where `DT[a == 1 & b == 0]` is "very slow". If that's the part that's slow for you more than likely you're doing something else wrong. – eddi Apr 23 '15 at 02:10

1 Answers1

5

I'm not sure why you think that even if DT[a == 1][b == 0, c := 2] worked in theory it would be more efficient than DT[a == 1 & b == 0, c := 2]

Either way, the most efficient solution in your case would be to key by both a and b and conduct the assignment by reference while performing a binary join on both

DT <- data.table(a = c(1, 1, 1, 2, 2), b = c(0, 2, 0, 1, 1)) ## mock data
setkey(DT, a, b) ## keying by both `a` and `b`
DT[J(1, 0), c := 2] ## Update `c` by reference
DT
#    a b  c
# 1: 1 0  2
# 2: 1 0  2
# 3: 1 2 NA
# 4: 2 1 NA
# 5: 2 1 NA
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • Thanks, can this also be extended to arbitrary conditions, say a %in% set and b < constant? – Hans-Peter Schrei Apr 22 '15 at 22:52
  • No. What's wrong with just do `DT[a %in% 1:2 & b == 0, c := 2]` for example? It should be very efficient as `data.table` sets secondary keys (if you didn't). But in this case you already keyed the data. – David Arenburg Apr 22 '15 at 23:00
  • I strongly dislike the answers that recommend keying to "speed up" simple look-up. The vast majority of the time that's the exact wrong thing to do and will result in slowing down, because users will take an already bad thing they did (which is running a non-vectorized loop), and then make it worse by adding an extra sort in there. – eddi Apr 23 '15 at 02:06
  • @eddi what do you mean by "vast majority of time"? If the data is small, the difference will ne in milliseconds. If the data is very big, binary join will be ten folds faster. This looks like the situation here, otherwise why would his look up be slow? Either way, `data.table` sets secondary keys when you perform a lookup on an unkeyed data set, that means that the data will be keyed if you want it or not. And the reason it is being keyed is because the package creators beleive that a lookup on a keyed data is much faster. – David Arenburg Apr 23 '15 at 06:53
  • I should note that I have posed my question quite terribly. I am actually more interested in the case of conditions such as a %in% set and b < constant. Thanks for the discussion and pointers into the right direction. – Hans-Peter Schrei Apr 23 '15 at 07:26
  • @DavidArenburg yes, lookup on a keyed data is much faster, **but** keying+lookup on a keyed data is usually slower than a lookup on non-keyed data. Secondary keys come at that same cost and I don't think they're useful outside of some very special situations. – eddi Apr 23 '15 at 07:40
  • @eddi what do you mean by "keying + lookup on **keyed** data?" the data wasn't keyed in the first place so this was just keying + lookup. But I suggested in my comment that now that the data is keyed, operation such `%in%` should very efficient too, didn't I? – David Arenburg Apr 23 '15 at 07:50
  • @DavidArenburg I mean `system.time({setkey(dt, a); dt[.(5)]})` is strictly larger than `system.time(dt[a == 5])` (assuming disabled secondary keys). One might argue that *if* there were many lookups done after keying then the cost might be worth it, and ime that scenario is extremely unlikely and oftentimes when it comes up on SO, it's an issue misusing loops. – eddi Apr 23 '15 at 08:07
  • 1
    Computing the order isn't *that expensive*. I've done extensive tests on data sets with up to a billion rows (and several columns). It can be slower than normal vector scan on really large data (>10 or 100 million rows) on the first run, but not by much (a few seconds). It's the *re-ordering* that is expensive. Hence `setkey()` wouldn't give much for normal operations.. Joins on large data benefit from cache efficiency due to sorted data. i.e., in large data where cache inefficiency trumps reordering time.. 1/2 [A nice way to compare this is with dplyr on really large data]. – Arun Apr 23 '15 at 12:31
  • 1
    Auto indexing is just the beginning of secondary keys.. They'll be extended to joins, rolling values on ordinary subsets etc.. You'll always be able to `setkey()` still if you believe your data is just too large that cache inefficiency trumps reordering. but the advantage of secondary keys is that we know exact indices of each value you'd like to subset/join, in addition to not having to reorder data. It's a great compromise between speed and functionality in most cases. – Arun Apr 23 '15 at 12:31
  • @Arun can you please give a couple of *common* scenarios where secondary key computation and caching is useful? and a Q - does the secondary key cache grow any time you compare smth with a new column, or does it get replaced with the latest column? – eddi Apr 24 '15 at 12:06
  • @eddi, a) On secondary keys, when extended, we'd be able to join as follows: `X[Y, on=c(a = "x", b = "y")]` - without setting keys. And possibly, `X[a > 5, col := val, roll = TRUE]` could roll values that evaluate to FALSE (if we use auto index) etc.. This is extremely advantageous syntactically (removes the need to know about keying at all), and would be incredibly efficient in most scenarios.. – Arun Apr 24 '15 at 12:55
  • b) On very very huge data sets, a huge overhead is moving data between memory and cache.. (read: main memory is SLOWWWW).. Having data sorted results in fewer cache misses, and gains a lot from prefetching as data is contiguous in memory. This can be so efficient that it trumps the overhead due to reordering. Most efficient scenarios would be joins on very very large data (and/or large number of groups). – Arun Apr 24 '15 at 12:57
  • For auto-indexing, we add an attribute of the column used, and store column order. When another column is used, it is also stored. But that's just for auto-indexing. For joins, we could/should compute on the fly (or use existing attributes if they are appropriate). Does this help a bit? – Arun Apr 24 '15 at 12:59
  • @Arun thanks! I'll buy the argument that for large joins you may want to pre-sort, and I'd very much appreciate an `on` param (though I don't think this requires secondary keys the way I see them now). The part that I'm not as convinced about is that it's a good idea to run a sort + store resulting indices for every column I put in the i-expression. – eddi Apr 24 '15 at 13:27