25

I'm trying to select rows in a dataframe where the string contained in a column matches either a regular expression or a substring:

dataframe:

aName   bName   pName   call  alleles   logRatio    strength
AX-11086564 F08_ADN103  2011-02-10_R10  AB  CG  0.363371    10.184215
AX-11086564 A01_CD1919  2011-02-24_R11  BB  GG  -1.352707   9.54909
AX-11086564 B05_CD2920  2011-01-27_R6   AB  CG  -0.183802   9.766334
AX-11086564 D04_CD5950  2011-02-09_R9   AB  CG  0.162586    10.165051
AX-11086564 D07_CD6025  2011-02-10_R10  AB  CG  -0.397097   9.940238
AX-11086564 B05_CD3630  2011-02-02_R7   AA  CC  2.349906    9.153076
AX-11086564 D04_ADN103  2011-02-10_R2   BB  GG  -1.898088   9.872966
AX-11086564 A01_CD2588  2011-01-27_R5   BB  GG  -1.208094   9.239801

For example, I want a dataframe containing only rows that contain ADN in column bName. Secondarily, I would like all rows that contain ADN in column bName and that match 2011-02-10_R2 in column pName.

I tried using functions grep(), agrep() and more but without success...

Stedy
  • 7,359
  • 14
  • 57
  • 77
Eric C.
  • 3,310
  • 2
  • 22
  • 29

7 Answers7

38
subset(dat, grepl("ADN", bName)  &  pName == "2011-02-10_R2" )

Note & (and not && which is not vectorized) and that == (and not = which is assignment).

Note that you could have used:

 dat[ with(dat,  grepl("ADN", bName)  &  pName == "2011-02-10_R2" ) , ]

... and that might be preferable when used inside functions, however, that will return NA values for any lines where dat$pName is NA. That defect (which some regard as a feature) could be removed by the addition of & !is.na(dat$pName) to the logical expression.

bathyscapher
  • 1,615
  • 1
  • 13
  • 18
IRTFM
  • 258,963
  • 21
  • 364
  • 487
9

Here you go.

First recreate your data:

dat <- read.table(text="
aName   bName   pName   call  alleles   logRatio    strength
AX-11086564 F08_ADN103  2011-02-10_R10  AB  CG  0.363371    10.184215
AX-11086564 A01_CD1919  2011-02-24_R11  BB  GG  -1.352707   9.54909
AX-11086564 B05_CD2920  2011-01-27_R6   AB  CG  -0.183802   9.766334
AX-11086564 D04_CD5950  2011-02-09_R9   AB  CG  0.162586    10.165051
AX-11086564 D07_CD6025  2011-02-10_R10  AB  CG  -0.397097   9.940238
AX-11086564 B05_CD3630  2011-02-02_R7   AA  CC  2.349906    9.153076
AX-11086564 D04_ADN103  2011-02-10_R2   BB  GG  -1.898088   9.872966
AX-11086564 A01_CD2588  2011-01-27_R5   BB  GG  -1.208094   9.239801
", header=TRUE)

Next, use grepl to construct a logical index of matches:

index1 <- with(dat, grepl("ADN", bName))
index2 <- with(dat, grepl("2011-02-10_R2", pName))

Now subset using the & operator:

dat[index1 & index2, ]
        aName      bName         pName call alleles  logRatio strength
7 AX-11086564 D04_ADN103 2011-02-10_R2   BB      GG -1.898088 9.872966
Andrie
  • 176,377
  • 47
  • 447
  • 496
3

Corrected according Andrie advice. I hope this should work. :)

df[grepl("ADN", df$bName),]
df[grepl("ADN", df$bName) & df$pName == "2011-02-10_R2",]
DrDom
  • 4,033
  • 1
  • 21
  • 23
0

This is a pretty minimal solution using dplyr and magrittr which I think is what you are after:

Data:
library(magrittr)
library(stringr)
dat <- read.table(text="
aName   bName   pName   call  alleles   logRatio    strength
                  AX-11086564 F08_ADN103  2011-02-10_R10  AB  CG  0.363371    10.184215
                  AX-11086564 A01_CD1919  2011-02-24_R11  BB  GG  -1.352707   9.54909
                  AX-11086564 B05_CD2920  2011-01-27_R6   AB  CG  -0.183802   9.766334
                  AX-11086564 D04_CD5950  2011-02-09_R9   AB  CG  0.162586    10.165051
                  AX-11086564 D07_CD6025  2011-02-10_R10  AB  CG  -0.397097   9.940238
                  AX-11086564 B05_CD3630  2011-02-02_R7   AA  CC  2.349906    9.153076
                  AX-11086564 D04_ADN103  2011-02-10_R2   BB  GG  -1.898088   9.872966
                  AX-11086564 A01_CD2588  2011-01-27_R5   BB  GG  -1.208094   9.239801
                  ", header=TRUE)

rows that contain ADN in column bName.

dat %>%
  filter(str_detect(bName, "ADN") == TRUE)

Secondarily, I would like all rows that contain ADN in column bName and that match 2011-02-10_R2 in column pName.

dat %>%
  filter(str_detect(bName, "ADN") & pName == "2011-02-10_R2") 
petergensler
  • 342
  • 2
  • 8
  • 23
0

I tested using Expresso and used .Net-style regexes; you may have to tweak for your regex flavor. I also left whitespace in for readability; remove or use a regex option flag to ignore.

The basic regex to capture all lines is:

(?<aName> [\w-]+ ) \s+ (?<bName> [\w_]+ ) \s+ (?<pName> [\w-_]+ ) \s+ (?<call> \w+ ) \s+ (?<alleles> \w+ ) \s+ (?<logRatio> [\d\.-]+ ) \s+ (?<strength> [\d\.-]+ ) 

From this, you just need to tweak the regex for the appropriate named capture group(s) to extract only the lines that you want. The modified version to capture using the criteria you gave (bName contains "ADN" and pName = "2011-02-10_R2") is:

(?<aName> [\w-]+ ) \s+ (?<bName> [\w_]*ADN[\w_]* ) \s+ (?<pName> 2011-02-10_R2 ) \s+ (?<call> \w+ ) \s+ (?<alleles> \w+ ) \s+ (?<logRatio> [\d\.-]+ ) \s+ (?<strength> [\d\.-]+ ) 
Darryl
  • 1,531
  • 15
  • 26
-1

The same logic as above

df %>% 
  filter(grepl("ADN", bName) & grepl("2011-02-10_R2", pName))
#     aName      bName         pName call alleles  logRatio     strength
# 1 AX-11086564 D04_ADN103 2011-02-10_R2   BB      GG -1.898088 9.872966
Tho Vu
  • 1,304
  • 2
  • 8
  • 20
-3

Why not just:

grep 'ADN'|grep '2011-02-10_R2'

You could also do this:

grep -P '\t.{4}(ADN).*(2011-02-10_R2).*'
Bjørne Malmanger
  • 1,457
  • 10
  • 11