3

I have a data.table and need to know the index of the row containing a minimal value under a given condition. Simple example:

dt <- data.table(i=11:13, val=21:23)

#     i  val
# 1: 11   21
# 2: 12   22
# 3: 13   23

Now, suppose I'd like to know in which row val is minimal under the condition i>=12, which is 2 in this case.

What didn't work:

dt[i>=12, which.min(val)]
# [1] 1

returns 1, because within dt[i>=12] it is the first row.

Also

dt[i>=12, .I[which.min(val)]]
# [1] 1

returned 1, because .I is only supposed to be used with grouping.

What did work:

To apply .I correctly, I added a grouping column:

dt[i>=12, g:=TRUE]
dt[i>=12, .I[which.min(val)], by=g][, V1]
# [1] 2

Note, that g is NA for i<12, thus which.min excludes that group from the result.

But, this requires extra computational power to add the column and perform the grouping. My productive data.table has several millions of rows and I have to find the minimum very often, so I'd like to avoid any extra computations.

Do you have any idea, how to efficiently solve this?

bjhend
  • 1,538
  • 11
  • 25
  • 3
    Why not just `dt[, idx := .I][i >= 12, idx[which.min(val)]]` where you create an index variable first? The overhead will be tiny surely. – thelatemail Jul 27 '17 at 23:04
  • @thelatemail: Interesting. I tried something with an index column, but it was more cumbersome so I left it out. However, it also requires adding another column but at least avoids grouping. – bjhend Jul 27 '17 at 23:40
  • 1
    You might want to also consider a dense rank if you are removing minimum values, then the next minimum, the the next... etc `frank(c(2,5,1,3,1),ties.method="dense")` for instance. – thelatemail Jul 27 '17 at 23:45
  • @thelatemail: Nice tip, didn't know about ``frank`` (or ``rank``), yet. However, I have to compute new values on each round, so the ranks will be different each time. – bjhend Jul 28 '17 at 00:12

2 Answers2

6

But, this requires extra computational power to add the column and perform the grouping.

So, keep the data sorted by it if it's so important:

setorder(dt, val)
dt[.(i_min = 12), on=.(i >= i_min), mult="first", which = TRUE]
# 2

This can also be extended to check more threshold i values. Just give a vector in i_min =:

dt[.(i_min = 9:14), on=.(i >= i_min), mult="first", which = TRUE]
# [1]  1  1  1  2  3 NA

How it works

x[i, on=, ...] is the syntax for a join.

  • i can be another table or equivalently a list of equal-length vectors.
  • .() is a shorthand for list().
  • on= can have inequalities for a "non-equi join".
  • mult= can determine what happens when a row of i has more than one match in x.
  • which=TRUE will return row numbers of x instead of the full joined table.
Frank
  • 66,179
  • 8
  • 96
  • 180
  • 1
    Wow, this is very impressive. It took me quite a while to understand it in detail. It's unbelievable, what a powerful tool ``data.table`` joins are. Thanks for this insight. However, my ``data.table`` is huge and I've to search for the minimum very often after removing the previous minimum from the ``data.table``. So, I'm afraid sorting will cost too much as even the ``log n`` in ``O(n * log n)`` becomes big. – bjhend Jul 27 '17 at 23:11
  • 1
    @bjhend If you mean that you remove rows, that can be very costly and is probably best avoided (since it can't be done without copying the whole table yet https://stackoverflow.com/q/10790204/ ). You could maybe "remove" by setting `i` to missing/NA. Even if you for some reason need to make a new table by dropping rows, if it was sorted before removal, it will still be sorted afterwards (since that's how sorting works), right? – Frank Jul 27 '17 at 23:20
  • 1
    I already suspected that removing is the biggest performance killer in my code (I know ``data.table``s are stored column-wise). So, I will definitely try your suggestion. Unfortunately, I cannot avoid reordering, because on every round I have to compute new values of which I've to find the minimum. – bjhend Jul 27 '17 at 23:25
  • Ok. Well, as you become more familiar with the syntax, you'll probably come up with other ways to improve efficiency that suit your use case. – Frank Jul 27 '17 at 23:27
1

You can use the fact that which.min will ignore NA values to "mask" the values you don't want to consider:

dt[,which.min(ifelse(i>=12, val, NA))]

As a simple example of this behavior, which.min(c(NA, 2, 1)) returns 3, because the 3rd element is the min among all the non-NA values.

Ryan C. Thompson
  • 40,856
  • 28
  • 97
  • 159