I have a huge dataset (more than 2 million rows of over 100 variables; below is a small sample). For each subj_trial
group, I want to find the first occurrence of each unique variable containing in ".wav" in message
. It should be just containing, not ending (i.e. *.wav), because some rows have a bunch of information in the message
fields (not pictured in the example, sorry).
It would be OK to output a data.frame that only had those three columns, but it's not necessary. I will later need to use the timestamp
column for analyses.
I've found this question: Extract rows for the first occurrence of a variable in a data frame, but for the life of me I cannot work that example to fit mine.
Here's some sample data:
subj_trial message timestamp
1 1_1 message 459 755616
2 1_1 . 755618
3 1_1 test1.wav 755662
4 1_1 . 765712
5 1_1 test1.wav 767918
6 1_2 . 769342
7 1_2 test2.wav 775662
8 1_2 . 786412
9 1_2 test2.wav 797460
10 1_2 . 807626
11 1_3 test3.wav 817794
12 1_3 warning 11 827960
13 2_1 message 481 817313
14 2_1 test1.wav 817347
15 2_1 . 834959
16 2_1 test1.wav 855007
17 2_1 . 880107
18 2_2 . 895723
19 2_2 test2.wav 922671
20 2_2 . 958003
21 2_2 test2.wav 994385
22 2_3 . 1016217
23 2_3 test3.wav 1036899
24 2_3 . 1047331
25 2_3 test3.wav 1142527
This is a very small example of what I'm dealing with, here. For each subj_trial
group there are probably 3000 lines, and there are over 700 groups.
Here's an example of what I'd like to have.
subj_trial message timestamp
1 1_1 test1.wav 755662
2 1_2 test2.wav 775662
3 1_3 test3.wav 817794
4 2_1 test1.wav 817347
5 2_2 test2.wav 922671
6 2_3 test3.wav 1036899
I've figured out how to get the unique values in message
over the entire dataset by doing this:
unique_message <- df[match(unique(df$message), df$message),]
But I can't figure out how to do it by group. I've also tried using group_by
in the dplyr
package but can't get that to work, either. Have mercy and show me the way, friends. Thanks!