6

I tried to search for the solution, but it appears that there is no clear one for R.
I try to split the string by the pattern of, let's say, space and capital letter and I use stringr package for that.

x <- "Foobar foobar, Foobar foobar"
str_split(x, " [:upper:]")

Normally I would get:

[[1]]
[1] "Foobar foobar," "oobar foobar"  

The output I would like to get, however, should include the letter from the delimiter:

[[1]]
[1] "Foobar foobar," "Foobar foobar"

Probably there is no out of box solution in stringr like back-referencing, so I would be happy to get any help.

perechen
  • 125
  • 9

2 Answers2

4

You may split with 1+ whitespaces that are followed with an uppercase letter:

> str_split(x, "\\s+(?=[[:upper:]])")
[[1]]
[1] "Foobar foobar," "Foobar foobar" 

Here,

  • \\s+ - 1 or more whitespaces
  • (?=[[:upper:]]) - a positive lookahead (a non-consuming pattern) that only checks for an uppercase letter immediately to the right of the current location in string without adding it to the match value, thus, preserving it in the output.

Note that \s matches various whitespace chars, not just plain regular spaces. Also, it is safer to use [[:upper:]] rather than [:upper:] - if you plan to use the patterns with other regex engines (like PCRE, for example).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Thanks a lot! This solves it easily, never used this regex trick. – perechen Jun 01 '18 at 20:10
  • @perechen Note that if you need to check for a comma before the whitespace chars, you may use akrun's suggestion to use a lookbehind, too - `(?<=,)`: `"(?<=,)\\s+(?=[[:upper:]])"`. This pattern will match 1+ whitespaces in between a comma and an uppercase letter. – Wiktor Stribiżew Jun 01 '18 at 20:13
0

We could use a regex lookaround to split at the space between a , and upper case character

str_split(x, "(?<=,) (?=[A-Z])")[[1]]
#[1] "Foobar foobar," "Foobar foobar" 
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    thank you, this is a good one, I would actually like to check for punctuation in the real-world task. – perechen Jun 01 '18 at 20:19