Extract rows for the first occurrence of a variable in a group

Question

I have a huge dataset (more than 2 million rows of over 100 variables; below is a small sample). For each subj_trial group, I want to find the first occurrence of each unique variable containing in ".wav" in message. It should be just containing, not ending (i.e. *.wav), because some rows have a bunch of information in the message fields (not pictured in the example, sorry).

It would be OK to output a data.frame that only had those three columns, but it's not necessary. I will later need to use the timestamp column for analyses.

I've found this question: Extract rows for the first occurrence of a variable in a data frame, but for the life of me I cannot work that example to fit mine.

Here's some sample data:

   subj_trial     message timestamp
1         1_1 message 459    755616
2         1_1           .    755618
3         1_1   test1.wav    755662
4         1_1           .    765712
5         1_1   test1.wav    767918
6         1_2           .    769342
7         1_2   test2.wav    775662
8         1_2           .    786412
9         1_2   test2.wav    797460
10        1_2           .    807626
11        1_3   test3.wav    817794
12        1_3  warning 11    827960
13        2_1 message 481    817313
14        2_1   test1.wav    817347
15        2_1           .    834959
16        2_1   test1.wav    855007
17        2_1           .    880107
18        2_2           .    895723
19        2_2   test2.wav    922671
20        2_2           .    958003
21        2_2   test2.wav    994385
22        2_3           .   1016217
23        2_3   test3.wav   1036899
24        2_3           .   1047331
25        2_3   test3.wav   1142527

This is a very small example of what I'm dealing with, here. For each subj_trial group there are probably 3000 lines, and there are over 700 groups.

Here's an example of what I'd like to have.

  subj_trial   message timestamp
1        1_1 test1.wav    755662
2        1_2 test2.wav    775662
3        1_3 test3.wav    817794
4        2_1 test1.wav    817347
5        2_2 test2.wav    922671
6        2_3 test3.wav   1036899

I've figured out how to get the unique values in message over the entire dataset by doing this:

unique_message <- df[match(unique(df$message), df$message),]

But I can't figure out how to do it by group. I've also tried using group_by in the dplyr package but can't get that to work, either. Have mercy and show me the way, friends. Thanks!

@SerbanTanasa Doing this would not be helpful because it would be 25 lines of just one group, with mostly "." in the `message` field and only one instance of a .wav value. The example I provided is a good one. — Elizabeth Crutchley, Nov 01 '16 at 21:38
@SerbanTanasa OK, I see that you are concerned about the format and not the content. Thanks for letting me know. — Elizabeth Crutchley, Nov 01 '16 at 21:44

Cotton.Rockwood · Answer 1 · 2016-11-01T23:57:10.453

Here is a dplyr solution as well, if you are interested:

dat %>%
  filter(grepl("\\.wav", message)) %>%
  group_by(subj_trial) %>%
  top_n(n=1, wt=desc(timestamp))

First, filter the data to just those containing *.wav in the message column. Then group the data by subject trial and return the top result with the smallest timestamp. This assumes you want the smallest timestamp, not necessarily the first one in the data set (i.e. if a record with a larger timestamp came first, it would NOT be returned). It wasn't clear to me which you were looking for, but perhaps in your case there is not difference.

And since I'm always curious about the efficiency differences between data.table and dplyrapproaches, I did a microbenchmark test. It looks like in this case, data.table has a slight speed advantage:

library(microbenchmark)
library(data.table)

set.seed(1)
dat <- data.frame(subj_trial=paste0(sample(1:20,1e6,replace=TRUE),"_",sample(1:20,1e6,replace=TRUE)),
                  message=sample(c(".wav","others"), 1e6, replace=TRUE),
                  timestamp=round(seq(from=1000, to=9142527, length.out = 1e6))) 

dat2 <- dat
setDT(dat2)

microbenchmark({dat %>%
  filter(grepl("\\.wav", message)) %>%
  group_by(subj_trial) %>%
  top_n(1, wt=desc(timestamp))},
  {dat2[grepl("\\.wav", message), .SD[1], by=subj_trial]})

Results:

Unit: milliseconds

expr

dat %>% filter(grepl("\\\\.wav", message)) %>% group_by(subj_trial) %>% top_n(1, wt = desc(timestamp))
dat2[grepl("\\\\.wav", message), .SD[1], by = subj_trial] 
      min       lq     mean   median       uq      max neval cld
 332.9693 357.7426 387.2245 367.6443 380.9935 637.9223   100   b
 263.0292 272.8627 293.4976 281.4568 285.7699 582.9954   100  a

Hey, thanks for your help! Three years down the line, and it still works brilliantly! — Will M, Sep 26 '19 at 22:10

score 3 · Answer 2 · edited Jun 20 '20 at 09:12

Also using data.table, but with a more concise formulation:

setDT(DT)
DT[,.SD[grep("\\.wav",message)[1]],by=subj_trial]

Edit: As suggested by a comment below,

DT[grepl("\\.wav", message), .SD[1], by=subj_trial]

might be even faster, since it uses boolean logic and the optimized I subsetting.

.SD is a data.table containing the Subset of DT's Data for each group, excluding any columns used in by (or keyby).

by is a bit like the group by operator in SQL. It designates the grouping column(s).

grep(pattern, x) returns the index of the all matches for the pattern in x, where x is a vector. The \\ before .wav prevents grep from treating . as a special character (in grep's parsing, an unescaped . means 'anything').

vector_name[1] returns the first element of a vector called vector_name. it can be called on the results of a function, such as grep above.

the data.table formula is DT[I,J,by] -- I is the subset or join, J is the operation to be performed, by is the grouping element. In our case, I is ignored (hence the leading ,) since we want to work on the full set. J is the operation on all .SD columns. by is the column you want your results grouped by.

Beat me to it - I was going to suggest `dat[grepl("\\.wav", message), .SD[1], by=subj_trial]` — thelatemail, Nov 01 '16 at 22:12
Just ran a quick benchmark of this on `set.seed(1); dat <- data.table(subj_trial=sample(1:1e5,1e6,replace=TRUE), message=sample(c(".wav","others"), 1e6, replace=TRUE))` . Moving the `grepl` into the `i` of data.table as per my comment above makes it a **lot** faster (30 secs vs. 0.3 secs) — thelatemail, Nov 01 '16 at 22:20

score 1 · Accepted Answer · edited Nov 01 '16 at 22:20

1

Using data.table:

library(data.table)
setDT(DT)
DT[,{
  id=head(grep("\\.wav",message),1)
  list(message=message[id],timestamp=timestamp[id])
},subj_trial]

#    subj_trial   message timestamp
# 1:        1_1 test1.wav    755662
# 2:        1_2 test2.wav    775662
# 3:        1_3 test3.wav    817794
# 4:        2_1 test1.wav    817347
# 5:        2_2 test2.wav    922671
# 6:        2_3 test3.wav   1036899

edited Nov 01 '16 at 22:20

Serban Tanasa

3,592
2
23
45

answered Nov 01 '16 at 21:37

agstudy

119,832
17
199
261

4

do you care to explain a bit how this works? Looks highly unreadable. – Serban Tanasa Nov 01 '16 at 21:43

Extract rows for the first occurrence of a variable in a group

3 Answers3

Results: