12

I'm trying to extract certain records from a dataframe with grepl.

This is based on the comparison between two columns Result and Names. This variable is build like this "WordNumber" but for the same word I have multiple numbers (more than 30), so when I use the grepl expression to get for instance Word1 I get also results that I would like to avoid, like Word12.

Any ideas on how to fix this?

Names <- c("Word1")
colnames(Names) <- name
Results <- c("Word1", "Word11", "Word12", "Word15")
Records <- c("ThisIsTheResultIWant", "notThis", "notThis", "notThis") 
Relationships <- data.frame(Results, Records)

Relationships <- subset(Relationships, grepl(paste(Names$name, collapse = "|"), Relationships$Results))

This doesn't work, if I use fixed = TRUE than it doesn't return any result at all (which is weird). I have also tried concatenating the name part with other numbers like this, but with no success:

Relationships <- subset(Relationships, grepl(paste(paste(Names$name, '3', sep = ""), collapse = "|"), Relationships$Results))

Since I'm concatenating I'm not really sure of how to use the \b to enforce a full match.

Any suggestions?

Luke Singham
  • 1,536
  • 2
  • 20
  • 38
Barbara
  • 1,118
  • 2
  • 11
  • 34
  • 1
    I think this is just `Relationships[Relationships$Results==Names,]` - if you end up doing `^Word1$` you're just doing a straight subset. If you have multiple names, then `Relationships[Relationships$Results %in% Names,]` instead. – thelatemail Sep 11 '17 at 11:02
  • @thelatemail It worked perfectly, thanks a lot! Could you post it as an answer? – Barbara Sep 11 '17 at 11:16

3 Answers3

18

In addition to @Richard's solution, there are multiple ways to enforce a full match.

\b

"\b" is an anchor to identify word before/after pattern

> grepl("\\bWord1\\b",c("Word1","Word2","Word12"))
[1]  TRUE FALSE FALSE

\< & \>

"\<" is an escape sequence for the beginning of a word, and ">" is used for end

> grepl("\\<Word1\\>",c("Word1","Word2","Word12"))
[1]  TRUE FALSE FALSE
parth
  • 1,571
  • 15
  • 24
  • how would this work in case of a list? I canno specify directly the single word to match. – Barbara Sep 11 '17 at 11:19
  • 2
    `Names <-paste0('\\b',Names,'\\b')` or `Names <-paste0('\\<',Names,'\\>')` for lists @Barbara – parth Sep 11 '17 at 11:31
8

Use ^ to match the start of the string and $ to match the end of the string

Names <-c('^Word1$')

Or, to apply to the entire names vector

Names <-paste0('^',Names,'$')
Richard
  • 1,121
  • 1
  • 9
  • 15
4

I think this is just:

Relationships[Relationships$Results==Names,]

If you end up doing ^Word1$ you're just doing a straight subset. If you have multiple names, then instead use:

Relationships[Relationships$Results %in% Names,]
thelatemail
  • 91,185
  • 12
  • 128
  • 188