8

Suppose I have a string like

s = "PleaseAddSpacesBetweenTheseWords"

How do I use gsub in R add a space between the words so that I get

"Please Add Spaces Between These Words"

I should do something like

gsub("[a-z][A-Z]", ???, s)

What do I put for ???. Also, I find the regular expression documentation for R confusing so a reference or writeup on regular expressions in R would be much appreciated.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
Ben
  • 4,774
  • 5
  • 22
  • 26

1 Answers1

34

You just need to capture the matches then use the \1 syntax to refer to the captured matches. For example

s = "PleaseAddSpacesBetweenTheseWords"
gsub("([a-z])([A-Z])", "\\1 \\2", s)
# [1] "Please Add Spaces Between These Words"

Of course, this just puts a space between each lower-case/upper-case letter pairings. It doesn't know what a real "word" is.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • Thanks. Does \\1 refer to the first letter in the match, \\2 the second, etc? And why are the brackets necessary? – Ben Nov 12 '14 at 21:35
  • 1
    \\1 refers to the stuff that matches the inside of the first set of parentheses, and similarly \\2 refers to the second set of parentheses. In this case, both parens surround regular expressions that match a single character. `[a-z]` means match any character from a to z (lowercase) one time. – blakeoft Nov 12 '14 at 21:37
  • @Ben Google around for "regex cheatsheet" . That'll give you a lot of useful info. – Carl Witthoft Nov 12 '14 at 21:39
  • 6
    `"([[:alpha:]])([[:upper:]])"` might be better, since it's less locale-specific and will split out single-letter words (assuming again that each word is capitalized only at its beginning). – Josh O'Brien Nov 12 '14 at 21:39
  • 1
    I assumed you know about the brackets since they were in your original regular expresion. Those define the character classes; the first being the lower case letters, and the second the upper case. If you were refering to the parenthesis, those tell the regular expression engine to remember what part of the string matched each particular expression. It would not work without the parenthesis; you would not match each letter separately. – MrFlick Nov 12 '14 at 21:40
  • Thank you. I meant parenthesis :) I shouldn't have been so sloppy. @Carl, thanks for the tip! – Ben Nov 12 '14 at 21:47
  • Depending on letters of different languages, I would consider. `gsub('\\p{Ll}\\K(?=\\p{Lu}+)', ' ', s, perl=T)` – hwnd Nov 12 '14 at 22:06