R subset rows for all occurrences of first variable in column

Question

I would like to create a subset of a large data frame based on two columns, ColA and ColB (below). For each variable in ColA (e.g. A, B, C...), I would like to extract the rows corresponding to each occurrence of the first variable in ColB. Therefore DF1:

ColA  ColB  ColC
A     Red     7thing
A     Red     OneBot
A     Blue    BotOne
B     Green   Thing7
B     Green   Twosies
B     Green   Square
B     Yellow  Circle
B     Yellow  Polygon
B     Purple  Triangle
B     White   Octagon
C     Orange  Cube
C     Black   Line

Would become DF2:

ColA  ColB  ColC
A     Red     7thing
A     Red     OneBot
B     Green   Thing7
B     Green   Twosies
B     Green   Square
C     Orange  Cube

I am not concerned about duplicates in ColC, and there is no pattern to the number of occurrences of a variable in ColB. The related subsetting strategies I have found focus on extracting unique cases, but I have neither come across nor been able to devise an approach to permit selecting all occurrences of the first variable type and would be grateful for some assistance.

I have tried

DF2 <- DF1[match(unique(DF1$ColB), DF1$ColB),]

...and subsetting that targets the variable when the variable content is known (e.g. Subset first n occurrences of certain value in dataframe in R), these are not appropriate seemingly.

what code have you tried? SO is a coding resource not a coding service — Nate, Aug 25 '16 at 01:13
I have edited my post to include some code, and could put more if helpful, but the problem is not a bug in otherwise appropriate code, which is why I did not include it. Apologies if my question came across as expecting a service, that is not the case, the SO guidelines seemed to suggest including code is not appropriate for every question, which is what I used as my guide. — SPZ, Aug 25 '16 at 01:29

thelatemail · Accepted Answer · 2016-08-25T01:35:46.007

5

The power of ave in base R, to compare all ColB cases to the first ColB in each group:

dat[with(dat, ColB == ave(ColB, ColA, FUN=function(x) head(x,1) )),]

#   ColA   ColB    ColC
#1     A    Red  7thing
#2     A    Red  OneBot
#4     B  Green  Thing7
#5     B  Green Twosies
#6     B  Green  Square
#11    C Orange    Cube

Using your original logic, you could also merge back on the non-duplicated records of ColA/ColB only:

merge(dat, dat[c("ColA","ColB")][!duplicated(dat$ColA),])

edited Aug 25 '16 at 01:35

answered Aug 25 '16 at 01:29

thelatemail

91,185
12
128
188

Wonderful, thank you for pointing out the merge tactic as well. – SPZ Aug 25 '16 at 01:43

score 3 · Answer 2 · answered Aug 25 '16 at 01:38

Another approach with aggregate

df[df$ColB %in% unlist(aggregate(ColB~ColA, df, function(x) head(x, 1))[2]), ]

#    ColA   ColB    ColC
#1     A    Red  7thing
#2     A    Red  OneBot
#4     B  Green  Thing7
#5     B  Green Twosies
#6     B  Green  Square
#11    C Orange    Cube

score 3 · Answer 3 · answered Aug 25 '16 at 02:42

As it is a big dataset, approaches using data.table are

library(data.table)
setDT(df1)[df1[, .I[ColB==ColB[1L]], ColA]$V1]
#   ColA   ColB    ColC
#1:    A    Red  7thing
#2:    A    Red  OneBot
#3:    B  Green  Thing7
#4:    B  Green Twosies
#5:    B  Green  Square
#6:    C Orange    Cube

Or with .SD

setDT(df1)[, .SD[ColB==ColB[1L]], ColA]

score 2 · Answer 4 · answered Aug 25 '16 at 01:28

2

You can use dplyr to accomplish what you want:

library(dplyr)
df2 <- df1 %>% group_by(ColA) %>% filter(ColB == first(ColB))

First group_by ColA, then filter to keep only rows where ColB equals the first value in ColB. The results using your data is:

print(df2)
##Source: local data frame [6 x 3]
##Groups: ColA [3]
##
##    ColA   ColB    ColC
##  <fctr> <fctr>  <fctr>
##1      A    Red  7thing
##2      A    Red  OneBot
##3      B  Green  Thing7
##4      B  Green Twosies
##5      B  Green  Square
##6      C Orange    Cube

answered Aug 25 '16 at 01:28

aichao

7,375
3
16
18

No idea why you're downvoted?! This works perfectly fine. – thelatemail Aug 25 '16 at 01:40
@thelatemail: thanks for the support! As long as the OP gets what he/she is looking for, I'm good. – aichao Aug 25 '16 at 01:45
Thanks aichao and thelatemail, this group_by approach is straightforward and useful! – SPZ Aug 25 '16 at 01:47

R subset rows for all occurrences of first variable in column

4 Answers4