0

I would like to create a subset of a large data frame based on two columns, ColA and ColB (below). For each variable in ColA (e.g. A, B, C...), I would like to extract the rows corresponding to each occurrence of the first variable in ColB. Therefore DF1:

ColA  ColB  ColC
A     Red     7thing
A     Red     OneBot
A     Blue    BotOne
B     Green   Thing7
B     Green   Twosies
B     Green   Square
B     Yellow  Circle
B     Yellow  Polygon
B     Purple  Triangle
B     White   Octagon
C     Orange  Cube
C     Black   Line

Would become DF2:

ColA  ColB  ColC
A     Red     7thing
A     Red     OneBot
B     Green   Thing7
B     Green   Twosies
B     Green   Square
C     Orange  Cube

I am not concerned about duplicates in ColC, and there is no pattern to the number of occurrences of a variable in ColB. The related subsetting strategies I have found focus on extracting unique cases, but I have neither come across nor been able to devise an approach to permit selecting all occurrences of the first variable type and would be grateful for some assistance.

I have tried

DF2 <- DF1[match(unique(DF1$ColB), DF1$ColB),]

...and subsetting that targets the variable when the variable content is known (e.g. Subset first n occurrences of certain value in dataframe in R), these are not appropriate seemingly.

Community
  • 1
  • 1
SPZ
  • 108
  • 6
  • 1
    what code have you tried? SO is a coding resource not a coding service – Nate Aug 25 '16 at 01:13
  • I have edited my post to include some code, and could put more if helpful, but the problem is not a bug in otherwise appropriate code, which is why I did not include it. Apologies if my question came across as expecting a service, that is not the case, the SO guidelines seemed to suggest including code is not appropriate for every question, which is what I used as my guide. – SPZ Aug 25 '16 at 01:29

4 Answers4

5

The power of ave in base R, to compare all ColB cases to the first ColB in each group:

dat[with(dat, ColB == ave(ColB, ColA, FUN=function(x) head(x,1) )),]

#   ColA   ColB    ColC
#1     A    Red  7thing
#2     A    Red  OneBot
#4     B  Green  Thing7
#5     B  Green Twosies
#6     B  Green  Square
#11    C Orange    Cube

Using your original logic, you could also merge back on the non-duplicated records of ColA/ColB only:

merge(dat, dat[c("ColA","ColB")][!duplicated(dat$ColA),])
thelatemail
  • 91,185
  • 12
  • 128
  • 188
3

Another approach with aggregate

df[df$ColB %in% unlist(aggregate(ColB~ColA, df, function(x) head(x, 1))[2]), ]

#    ColA   ColB    ColC
#1     A    Red  7thing
#2     A    Red  OneBot
#4     B  Green  Thing7
#5     B  Green Twosies
#6     B  Green  Square
#11    C Orange    Cube
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
3

As it is a big dataset, approaches using data.table are

library(data.table)
setDT(df1)[df1[, .I[ColB==ColB[1L]], ColA]$V1]
#   ColA   ColB    ColC
#1:    A    Red  7thing
#2:    A    Red  OneBot
#3:    B  Green  Thing7
#4:    B  Green Twosies
#5:    B  Green  Square
#6:    C Orange    Cube

Or with .SD

setDT(df1)[, .SD[ColB==ColB[1L]], ColA]
akrun
  • 874,273
  • 37
  • 540
  • 662
2

You can use dplyr to accomplish what you want:

library(dplyr)
df2 <- df1 %>% group_by(ColA) %>% filter(ColB == first(ColB))

First group_by ColA, then filter to keep only rows where ColB equals the first value in ColB. The results using your data is:

print(df2)
##Source: local data frame [6 x 3]
##Groups: ColA [3]
##
##    ColA   ColB    ColC
##  <fctr> <fctr>  <fctr>
##1      A    Red  7thing
##2      A    Red  OneBot
##3      B  Green  Thing7
##4      B  Green Twosies
##5      B  Green  Square
##6      C Orange    Cube
aichao
  • 7,375
  • 3
  • 16
  • 18