Apply which.min to data.table under a condition

Question

I have a data.table and need to know the index of the row containing a minimal value under a given condition. Simple example:

dt <- data.table(i=11:13, val=21:23)

#     i  val
# 1: 11   21
# 2: 12   22
# 3: 13   23

Now, suppose I'd like to know in which row val is minimal under the condition i>=12, which is 2 in this case.

What didn't work:

dt[i>=12, which.min(val)]
# [1] 1

returns 1, because within dt[i>=12] it is the first row.

Also

dt[i>=12, .I[which.min(val)]]
# [1] 1

returned 1, because .I is only supposed to be used with grouping.

What did work:

To apply .I correctly, I added a grouping column:

dt[i>=12, g:=TRUE]
dt[i>=12, .I[which.min(val)], by=g][, V1]
# [1] 2

Note, that g is NA for i<12, thus which.min excludes that group from the result.

But, this requires extra computational power to add the column and perform the grouping. My productive data.table has several millions of rows and I have to find the minimum very often, so I'd like to avoid any extra computations.

Do you have any idea, how to efficiently solve this?

Why not just `dt[, idx := .I][i >= 12, idx[which.min(val)]]` where you create an index variable first? The overhead will be tiny surely. — thelatemail, Jul 27 '17 at 23:04
@thelatemail: Interesting. I tried something with an index column, but it was more cumbersome so I left it out. However, it also requires adding another column but at least avoids grouping. — bjhend, Jul 27 '17 at 23:40
You might want to also consider a dense rank if you are removing minimum values, then the next minimum, the the next... etc `frank(c(2,5,1,3,1),ties.method="dense")` for instance. — thelatemail, Jul 27 '17 at 23:45
@thelatemail: Nice tip, didn't know about ``frank`` (or ``rank``), yet. However, I have to compute new values on each round, so the ranks will be different each time. — bjhend, Jul 28 '17 at 00:12

Frank · Answer 1 · 2017-07-27T22:52:09.443

6

But, this requires extra computational power to add the column and perform the grouping.

So, keep the data sorted by it if it's so important:

setorder(dt, val)
dt[.(i_min = 12), on=.(i >= i_min), mult="first", which = TRUE]
# 2

This can also be extended to check more threshold i values. Just give a vector in i_min =:

dt[.(i_min = 9:14), on=.(i >= i_min), mult="first", which = TRUE]
# [1]  1  1  1  2  3 NA

How it works

x[i, on=, ...] is the syntax for a join.

i can be another table or equivalently a list of equal-length vectors.
.() is a shorthand for list().
on= can have inequalities for a "non-equi join".
mult= can determine what happens when a row of i has more than one match in x.
which=TRUE will return row numbers of x instead of the full joined table.

edited Jul 27 '17 at 22:52

answered Jul 27 '17 at 22:45

Frank

66,179
8
96
180

1

Wow, this is very impressive. It took me quite a while to understand it in detail. It's unbelievable, what a powerful tool ``data.table`` joins are. Thanks for this insight. However, my ``data.table`` is huge and I've to search for the minimum very often after removing the previous minimum from the ``data.table``. So, I'm afraid sorting will cost too much as even the ``log n`` in ``O(n * log n)`` becomes big. – bjhend Jul 27 '17 at 23:11
1

@bjhend If you mean that you remove rows, that can be very costly and is probably best avoided (since it can't be done without copying the whole table yet https://stackoverflow.com/q/10790204/ ). You could maybe "remove" by setting `i` to missing/NA. Even if you for some reason need to make a new table by dropping rows, if it was sorted before removal, it will still be sorted afterwards (since that's how sorting works), right? – Frank Jul 27 '17 at 23:20
1

I already suspected that removing is the biggest performance killer in my code (I know ``data.table``s are stored column-wise). So, I will definitely try your suggestion. Unfortunately, I cannot avoid reordering, because on every round I have to compute new values of which I've to find the minimum. – bjhend Jul 27 '17 at 23:25
Ok. Well, as you become more familiar with the syntax, you'll probably come up with other ways to improve efficiency that suit your use case. – Frank Jul 27 '17 at 23:27

score 1 · Accepted Answer · answered Jul 27 '17 at 22:29

1

You can use the fact that which.min will ignore NA values to "mask" the values you don't want to consider:

dt[,which.min(ifelse(i>=12, val, NA))]

As a simple example of this behavior, which.min(c(NA, 2, 1)) returns 3, because the 3rd element is the min among all the non-NA values.

answered Jul 27 '17 at 22:29

Ryan C. Thompson

40,856
28
97
159

Thanks, this looks quite efficient to me. I assume it has to go through the rows only two times resulting in a ``O(2*n)`` complexity. – bjhend Jul 27 '17 at 23:15
I think that's right. Once to build the masked value vector and again to find the min. – Ryan C. Thompson Jul 28 '17 at 04:22

Apply which.min to data.table under a condition

What didn't work:

What did work:

2 Answers2