2

I am trying to write a function so that i can get all the substrings from a string that matches a regular expression , example : -

str <- "hello Brother How are you"

I want to extract all the substrings from str , where those substrings matches this regular expression - "[A-z]+ [A-z]+"

which results in -

"hello Brother"
"Brother How"
"How are"
"are you"

is there any library function which can do that ?

mtoto
  • 23,919
  • 4
  • 58
  • 71
Partha Roy
  • 1,575
  • 15
  • 16

2 Answers2

2

You can do it with stringr library str_match_all function and the method Tim Pietzcker described in his answer (capturing inside an unanchored positive lookahead):

> library(stringr)
> str <- "hello Brother How are you"
> res <- str_match_all(str, "(?=\\b([[:alpha:]]+ [[:alpha:]]+))")
> l <- unlist(res)
> l[l != ""]
## [1] "hello Brother" "Brother How"   "How are"       "are you"

Or to only get unqiue values:

> unique(l[l != ""])
##[1] "hello Brother" "Brother How"   "How are"       "are you"      

I just advise to use [[:alpha:]] instead of [A-z] since this pattern matches more than just letters.

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • thanks @wiktor .. It worked !! .. I was looking for this kind of function ... so thanks – Partha Roy Jan 28 '16 at 12:04
  • 1
    You can actually access those strings without getting unique values, but you will still get empty values in the first column after running `str_match_all` (I tried to show a way how to get non-empty values). Note that `str_extract_all` only gets matches, thus, you cannot use that one. – Wiktor Stribiżew Jan 28 '16 at 12:07
  • ya @wiktor i wanted to to know the method of extracting the matches , and i have modified your code according to my need.. thank you ;) – Partha Roy Jan 28 '16 at 13:32
1

Regex matches "consume" the text they match, therefore (generally) the same bit of text can't match twice. But there are constructs called lookaround assertions which don't consume the text they match, and which may contain capturing groups.

That makes your endeavor possible (although you can't use [A-z], that doesn't do what you think it does):

(?=\b([A-Za-z]+ [A-Za-z]+))

will match as expected; you need to look at group 1 of the match result, not the matched text itself (which will always be empty).

The \b word boundary anchor is necessary to ensure that our matches always start at the beginning of a word (otherwise you'd also have the results "ello Brother", "llo Brother", "lo Brother", and "o Brother").

Test it live on regex101.com.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • Thanks @tim , actually i am able to extract the first match , but i want to extract each and every possible matches .. – Partha Roy Jan 28 '16 at 11:43