How to use slice in data.table in R

Question

I have a large dataset with 200.000+ rows. I want to group it by two columns, filter based on another column, while keeping all the other columns. Sample looks like this:

   hv001 hv002 hv101 hv104 hv105
1      1     1     1     1    35
2      1     1     2     2    22
3      1     1     3     2     2
4      1     1     3     2     0
5      1     4     1     1    35
6      1     4     2     2    32
7      1     4     3     2     4
8      1     4     3     2     2
9      1     4     3     1     1
10     1     7     1     1    51

structure(list(hv001 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), hv002 = c(1, 
1, 1, 1, 4, 4, 4, 4, 4, 7), hv101 = c(1, 2, 3, 3, 1, 2, 3, 3, 
3, 1), hv104 = c(1, 2, 2, 2, 1, 2, 2, 2, 1, 1), hv105 = c(35, 
22, 2, 0, 35, 32, 4, 2, 1, 51)), row.names = c(NA, -10L), class = "data.frame")

I can do this with dplyr, however it is too slow. Code looks like this:

  df |>  
    group_by(hv001, hv002) |> 
    slice_min(hv101)

Now with data.table it is possible to do this much faster, however it does not keep all the columns. How do keep all columns or how do I slice the columns when doing a group by? Code looks like this:

  df_dt <- as.data.table(df)
  df_dt[, .(min_hv101 = min(hv101)), by = c("hv001", "hv002")]

score 0 · Accepted Answer · answered Jan 08 '23 at 12:18

0

You could add .SD to subset your data table like this:

library(data.table)
df_dt <- as.data.table(df)
df_dt[, .SD[(min_hv101 = min(hv101))], by = c("hv001", "hv002")]
#>    hv001 hv002 hv101 hv104 hv105
#> 1:     1     1     1     1    35
#> 2:     1     4     1     1    35
#> 3:     1     7     1     1    51

^{Created on 2023-01-08 with reprex v2.0.2}

This answer explains the meaning of .SD very well: What does .SD stand for in data.table in R

answered Jan 08 '23 at 12:18

Quinten

35,235
5
20
53

2

This only works because the value in `hv101` corresponds to its rank. It should be `which.min`. – Andre Wildberg Jan 08 '23 at 12:33

How to use slice in data.table in R

1 Answers1