data.table - keep first row per group OR based on condition

Question

I would like to keep 1st obs in group OR mpg >= 10. Is there any way we can do without creating a grouping of variables from .N?

I am looking for a solution using data.table package. I tried below but it is expecting j so get warning.

library(data.table)

x <- mtcars

setDT(x)

x[.N==1 | mpg >= 10,,by=carb]

Related: [Select the first row by group](https://stackoverflow.com/questions/13279582/select-the-first-row-by-group); — Henrik, Oct 07 '21 at 21:39

score 2 · Accepted Answer · answered Oct 07 '21 at 19:08

Try this.

Using mpg >= 50, we should get one row per carb:

x[ rowid(carb) == 1 | mpg >= 50,]
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1:  21.0     6 160.0   110  3.90  2.62 16.46     0     1     4     4
# 2:  22.8     4 108.0    93  3.85  2.32 18.61     1     1     4     1
# 3:  18.7     8 360.0   175  3.15  3.44 17.02     0     0     3     2
# 4:  16.4     8 275.8   180  3.07  4.07 17.40     0     0     3     3
# 5:  19.7     6 145.0   175  3.62  2.77 15.50     0     1     5     6
# 6:  15.0     8 301.0   335  3.54  3.57 14.60     0     1     5     8

Using mpg >= 30 (since all(mpg > 10)), we should get all of the above plus a few more:

x[ rowid(carb) == 1 | mpg >= 30,]
#       mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#     <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#  1:  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
#  2:  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
#  3:  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
#  4:  16.4     8 275.8   180  3.07 4.070 17.40     0     0     3     3
#  5:  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
#  6:  30.4     4  75.7    52  4.93 1.615 18.52     1     1     4     2
#  7:  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
#  8:  30.4     4  95.1   113  3.77 1.513 16.90     1     1     5     2
#  9:  19.7     6 145.0   175  3.62 2.770 15.50     0     1     5     6
# 10:  15.0     8 301.0   335  3.54 3.570 14.60     0     1     5     8

An alternative, in case you need more grouping variables:

x[, .SD[seq_len(.N) == 1L | mpg >= 30,], by = carb]

though I've been informed that rowid(...) is more efficient than seq_len(.N).

Another way to avoid `by` would be `!duplicated`. (in case of more grouping variables, `duplicated.data.table` has a `by` argument`) — Henrik, Oct 07 '21 at 21:35
Okay, but I don't know how finding duplicated rows is going to help this process. Can you expand on how you'd use that in the filtering? @Henrik — r2evans, Oct 07 '21 at 22:03
Note the `!`. The _first_ row in a set of identitical values is the "not duplicated" row, all subsequent rows are duplicated. — Henrik, Oct 08 '21 at 01:41
Okay, I think I see what you're saying there. I guess that would make it `x[!duplicated(x, by="carb") | mpg >= 30,]`. It's an interesting alternative, thank you for clarifying. — r2evans, Oct 08 '21 at 03:41
If duplication is determined by a single variable, it's enough to provide the variable name (I suppose `duplicated.default` is called then); e.g. `x[!duplicated(carb)]`. See also the link I provided above. — Henrik, Oct 08 '21 at 06:28
Yes, but the OP has other conditions where de-duplication is but one step. `x[!duplicated(carb)]` does not correctly answer the question here. — r2evans, Oct 08 '21 at 14:10
I completely agree. The duplicated stuff was just an alternative to rowid part. Cheers — Henrik, Oct 08 '21 at 16:58

akrun · Answer 2 · 2021-10-07T19:09:09.763

1

We can use .I to get the rowindex for subsetting

 x[x[, .I[seq_len(.N) == 1|mpg >= 30], by = carb]$V1]
     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
 1: 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
 2: 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
 3: 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
 4: 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
 5: 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
 6: 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
 7: 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
 8: 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
 9: 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
10: 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

edited Oct 07 '21 at 19:09

answered Oct 07 '21 at 18:49

akrun

874,273
37
540
662

Just to test it. I used mpg >= 50 and I was expecting 1 row per group as it is OR condition but it is only showing 2 rows. I am looking for either 1st row in group OR any other row where mpg >= 10 (even though it is not 1st row in group) – R007 Oct 07 '21 at 18:58
@R007 Are you saying that you want to get the first row where mpg >= 10 and if it doesn't exist, then the first row – akrun Oct 07 '21 at 19:08
@R007 Please check if this is the expected output or not – akrun Oct 07 '21 at 19:10

data.table - keep first row per group OR based on condition

2 Answers2