Using regexp to select rows in R dataframe

Question

I'm trying to select rows in a dataframe where the string contained in a column matches either a regular expression or a substring:

dataframe:

aName   bName   pName   call  alleles   logRatio    strength
AX-11086564 F08_ADN103  2011-02-10_R10  AB  CG  0.363371    10.184215
AX-11086564 A01_CD1919  2011-02-24_R11  BB  GG  -1.352707   9.54909
AX-11086564 B05_CD2920  2011-01-27_R6   AB  CG  -0.183802   9.766334
AX-11086564 D04_CD5950  2011-02-09_R9   AB  CG  0.162586    10.165051
AX-11086564 D07_CD6025  2011-02-10_R10  AB  CG  -0.397097   9.940238
AX-11086564 B05_CD3630  2011-02-02_R7   AA  CC  2.349906    9.153076
AX-11086564 D04_ADN103  2011-02-10_R2   BB  GG  -1.898088   9.872966
AX-11086564 A01_CD2588  2011-01-27_R5   BB  GG  -1.208094   9.239801

For example, I want a dataframe containing only rows that contain ADN in column bName. Secondarily, I would like all rows that contain ADN in column bName and that match 2011-02-10_R2 in column pName.

I tried using functions grep(), agrep() and more but without success...

score 38 · Accepted Answer · edited Feb 25 '23 at 09:24

subset(dat, grepl("ADN", bName)  &  pName == "2011-02-10_R2" )

Note & (and not && which is not vectorized) and that == (and not = which is assignment).

Note that you could have used:

 dat[ with(dat,  grepl("ADN", bName)  &  pName == "2011-02-10_R2" ) , ]

... and that might be preferable when used inside functions, however, that will return NA values for any lines where dat$pName is NA. That defect (which some regard as a feature) could be removed by the addition of & !is.na(dat$pName) to the logical expression.

score 9 · Answer 2 · answered Mar 01 '12 at 17:31

Here you go.

First recreate your data:

dat <- read.table(text="
aName   bName   pName   call  alleles   logRatio    strength
AX-11086564 F08_ADN103  2011-02-10_R10  AB  CG  0.363371    10.184215
AX-11086564 A01_CD1919  2011-02-24_R11  BB  GG  -1.352707   9.54909
AX-11086564 B05_CD2920  2011-01-27_R6   AB  CG  -0.183802   9.766334
AX-11086564 D04_CD5950  2011-02-09_R9   AB  CG  0.162586    10.165051
AX-11086564 D07_CD6025  2011-02-10_R10  AB  CG  -0.397097   9.940238
AX-11086564 B05_CD3630  2011-02-02_R7   AA  CC  2.349906    9.153076
AX-11086564 D04_ADN103  2011-02-10_R2   BB  GG  -1.898088   9.872966
AX-11086564 A01_CD2588  2011-01-27_R5   BB  GG  -1.208094   9.239801
", header=TRUE)

Next, use grepl to construct a logical index of matches:

index1 <- with(dat, grepl("ADN", bName))
index2 <- with(dat, grepl("2011-02-10_R2", pName))

Now subset using the & operator:

dat[index1 & index2, ]
        aName      bName         pName call alleles  logRatio strength
7 AX-11086564 D04_ADN103 2011-02-10_R2   BB      GG -1.898088 9.872966

DrDom · Answer 3 · 2012-03-01T17:51:04.547

3

Corrected according Andrie advice. I hope this should work. :)

df[grepl("ADN", df$bName),]
df[grepl("ADN", df$bName) & df$pName == "2011-02-10_R2",]

edited Mar 01 '12 at 17:51

answered Mar 01 '12 at 17:36

DrDom

4,033
1
21
23

Both of these statements returns an empty dataframe, not the desired rows 1 and 7. – Andrie Mar 01 '12 at 17:42

score 0 · Answer 4 · answered Apr 01 '17 at 17:01

This is a pretty minimal solution using dplyr and magrittr which I think is what you are after:

Data:
library(magrittr)
library(stringr)
dat <- read.table(text="
aName   bName   pName   call  alleles   logRatio    strength
                  AX-11086564 F08_ADN103  2011-02-10_R10  AB  CG  0.363371    10.184215
                  AX-11086564 A01_CD1919  2011-02-24_R11  BB  GG  -1.352707   9.54909
                  AX-11086564 B05_CD2920  2011-01-27_R6   AB  CG  -0.183802   9.766334
                  AX-11086564 D04_CD5950  2011-02-09_R9   AB  CG  0.162586    10.165051
                  AX-11086564 D07_CD6025  2011-02-10_R10  AB  CG  -0.397097   9.940238
                  AX-11086564 B05_CD3630  2011-02-02_R7   AA  CC  2.349906    9.153076
                  AX-11086564 D04_ADN103  2011-02-10_R2   BB  GG  -1.898088   9.872966
                  AX-11086564 A01_CD2588  2011-01-27_R5   BB  GG  -1.208094   9.239801
                  ", header=TRUE)

rows that contain ADN in column bName.

dat %>%
  filter(str_detect(bName, "ADN") == TRUE)

Secondarily, I would like all rows that contain ADN in column bName and that match 2011-02-10_R2 in column pName.

dat %>%
  filter(str_detect(bName, "ADN") & pName == "2011-02-10_R2")

You don't need to test against TRUE; `str_detect(bName, "ADN")` is enough — camille, Jan 23 '20 at 13:52

Darryl · Answer 5 · 2012-03-01T18:09:08.657

I tested using Expresso and used .Net-style regexes; you may have to tweak for your regex flavor. I also left whitespace in for readability; remove or use a regex option flag to ignore.

The basic regex to capture all lines is:

(?<aName> [\w-]+ ) \s+ (?<bName> [\w_]+ ) \s+ (?<pName> [\w-_]+ ) \s+ (?<call> \w+ ) \s+ (?<alleles> \w+ ) \s+ (?<logRatio> [\d\.-]+ ) \s+ (?<strength> [\d\.-]+ )

From this, you just need to tweak the regex for the appropriate named capture group(s) to extract only the lines that you want. The modified version to capture using the criteria you gave (bName contains "ADN" and pName = "2011-02-10_R2") is:

(?<aName> [\w-]+ ) \s+ (?<bName> [\w_]*ADN[\w_]* ) \s+ (?<pName> 2011-02-10_R2 ) \s+ (?<call> \w+ ) \s+ (?<alleles> \w+ ) \s+ (?<logRatio> [\d\.-]+ ) \s+ (?<strength> [\d\.-]+ )

score -1 · Answer 6 · answered Aug 11 '20 at 03:21

-1

The same logic as above

df %>% 
  filter(grepl("ADN", bName) & grepl("2011-02-10_R2", pName))
#     aName      bName         pName call alleles  logRatio     strength
# 1 AX-11086564 D04_ADN103 2011-02-10_R2   BB      GG -1.898088 9.872966

answered Aug 11 '20 at 03:21

Tho Vu

1,304
2
8
20

Bjørne Malmanger · Answer 7 · 2012-03-01T17:42:59.650

-3

Why not just:

grep 'ADN'|grep '2011-02-10_R2'

You could also do this:

grep -P '\t.{4}(ADN).*(2011-02-10_R2).*'

edited Mar 01 '12 at 17:42

answered Mar 01 '12 at 17:30

Bjørne Malmanger

1,457
10
11

10

Because the language is [tag:R] – Andrie Mar 01 '12 at 17:31
1

Sorry, didn't notice the 'r'. I guess you just could copy my regexp then – Bjørne Malmanger Mar 01 '12 at 17:44

Using regexp to select rows in R dataframe

7 Answers7

Linked