2

The original Title for this Question was : R Regex for word boundary excluding space.It reflected the manner I was approaching the problem in. However, this is a better solution to my particular problem. It should work as long as a particular delimiter is used to separate items within a 'cell'

This must be very simple, but I've hit a brick wall on it. I have a dataframe column where each cell(row) is a comma separated list of items. I want to find the rows that have a specific item.

df<-data.frame( nms=  c("XXXCAP,XXX CAPITAL LIMITED" , "XXX,XXX POLYMERS LIMITED, 3455" , "YYY,XXX REP LIMITED,999,XXX" ), 
        b = c('A', 'X', "T"))  
                             nms b
1     XXXCAP,XXX CAPITAL LIMITED A
2 XXX,XXX POLYMERS LIMITED, 3455 X
3    YYY,XXX REP LIMITED,999,XXX T

I want to search for rows that have item XXX. Rows 2 and 3 should match. Row 1 has the string XXX as part of a larger string and obviously should not match.

However, because XXX in row 1 is separated by spaces in each side, I am having trouble filtering it out with \\b or [[:<:]]

grep("\\bXXX\\b",df$nms, value = F) #matches 1,2,3

The easiest way to do this of course is strsplit() but I'd like to avoid it.Any suggestions on performance are welcome.

R.S.
  • 2,093
  • 14
  • 29
  • When `\b` does not "work", the problem lies in the definition of the "whole word". Please add the details about why the first string does not contain an `XXX` "whole word" (it seems you want to only match a word in between commas or start/end of the string). – Wiktor Stribiżew Jul 22 '18 at 19:07

2 Answers2

2

When \b does not "work", the problem usually lies in the definition of the "whole word".

A word boundary can occur in one of three positions:

  • Before the first character in the string, if the first character is a word character.
  • After the last character in the string, if the last character is a word character.
  • Between two characters in the string, where one is a word character and the other is not a word character.

It seems you want to only match a word in between commas or start/end of the string).

You may use a PCRE regex (note the perl=TRUE argument) like

(?<![^,])XXX(?![^,])

See the regex demo (the expression is "converted" to use positive lookarounds due to the fact it is a demo with a single multiline string).

Details

  • (?<![^,]) (equal to (?<=^|,)) - either start of the string or a comma
  • XXX - an XXX word
  • (?![^,]) (equal to (?=$|,)) - either end of the string or a comma

R demo:

> grep("(?<![^,])XXX(?![^,])",df$nms, value = FALSE, perl=TRUE)
## => [1] 2 3

The equivalent TRE regex will look like

> grep("(?:^|,)XXX(?:$|,)",df$nms, value = FALSE)

Note that here, non-capturing groups are used to match either start of string or , (see (?:^|,)) and either end of string or , (see ((?:$|,))).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Many thanks. This seems to also match `XXX REP LIMITED` I am trying to figure out why.please take a look https://regex101.com/r/mLoCRn/2 – R.S. Jul 22 '18 at 17:44
  • @R.S. Your question sounds as if you wanted to match a string that contains a whole word and no substring in a longer word. I see from the comment to the below question you want to match a string that contains a value only in between commas or start/end of string. You may also use `(?<=,|^)XXX(?=,|$)` or `(?<![^,])XXX(?![^,])` – Wiktor Stribiżew Jul 22 '18 at 18:58
  • @R.S. Please update the question with the real requirements so as to make it useful for future readers and make our answers more relevant (you might raise more upvotes if your question is clear). – Wiktor Stribiżew Jul 22 '18 at 19:04
  • Thanks. Maybe I should have added another row to the dataframe as I have done at regex101 link you provided. I thought my sentence that said #1 was ineligible because there "XXX was a part of a larger string" covered it . Will update. – R.S. Jul 22 '18 at 19:11
  • @R.S. Well, that is why word boundaries came to mind, but `\b` in regex is a very specific construct, see my latest edit where I add the information about what `\b` actually matches. – Wiktor Stribiżew Jul 22 '18 at 19:17
  • I hope you are still following this thread and have the patience for me . I guess using a TRE non-capturing group as you have suggest will have some performance advantage over simple `"(^|,)XXX(,|$)"` and that's what I am going to use. However, the first one, `"(?<![^,])XXX(?![^,])"` confuses me. Does it mean that when using lookbacks, a `^` inside `[^,]` will not act as a negation? regex101 seem to have a trouble with that, though it's working fine in R. – R.S. Jul 23 '18 at 10:53
  • 1
    @R.S. At regex101, you test a regex against a *single* string with line breaks. In real life, you want to run the regex against a set of separate strings. `(?<![^,])` matches a location that is not immediately preceded with a char other than a comma. That means, it matches a location that is preceded with a comma or start of a *string*, not a *line*. Same with the lookahead. Do not use online testers blindly, they may cheat you. And I would recommend against using `(?<=^|,)` due to the alternation inside, it is one of the most resource consuming regex features. – Wiktor Stribiżew Jul 23 '18 at 22:24
  • Your explanation makes the logic painfully obvious! Pity I missed it. Truly appreciated. – R.S. Jul 24 '18 at 07:05
  • @R.S. Just [another example](https://regex101.com/r/qkLaMG/1) why you should be cautious about regex online testers: although I chose JavaScript regex flavor, the substitution is allowed to contain operators that are not really supported in JS `replace` method. And it is not the only bug there. – Wiktor Stribiżew Jul 24 '18 at 07:09
  • Oh. Even then, online regex editors are an indispensable tool. And this one has probably the best interface . – R.S. Jul 24 '18 at 07:30
  • @R.S. Yes, I agree. I just wanted to show that because there are a lot of people who only answer regex questions with "Try-this" and a link to regex101. It is not always a proof the regex will actually work in the target environment. – Wiktor Stribiżew Jul 24 '18 at 07:34
0

This is perhaps a somewhat simplistic solution, but it works for the examples which you've provided:

library(stringr)

df$nms %>%
  str_replace_all('\\s', '') %>% # Removes all spaces, tabs, newlines, etc
  str_detect('(^|,)XXX(,|$)')    # Detects string XXX surrounded by comma or beginning/end

[1] FALSE  TRUE  TRUE

Also, have a look at this cheatsheet made by RStudio on Regular Expressions - it is very nicely made and very useful (I keep going back to it when I'm in doubt).

Vlad C.
  • 944
  • 7
  • 12
  • In this case "XX X" will be treated similar to "XXX" `"AAA,XX X,YY"%>% str_replace_all('\\s', '') %>% str_detect('(^|,)XXX(,|$)') ` – R.S. Jul 22 '18 at 18:11
  • However, it seems your answer has pointed me in the right direction. I did not know I could `or` an `anchor ^ $` like this. So a simple `grep("(^|,)XXX(,|$)",df$nms, value = FALSE, perl=TRUE)` works – R.S. Jul 22 '18 at 18:28
  • Thanks for the suggestions. – R.S. Jul 24 '18 at 07:07