14

I'm working on a choropleth in R and need to be able to match state names with match.map(). The dataset I'm using sticks multi-word names together, like NorthDakota and DistrictOfColumbia.

How can I use regular expressions to insert a space between lower-upper letter sequences? I've successfully added a space but haven't been able to preserve the letters that indicate where the space goes.

places = c("NorthDakota", "DistrictOfColumbia")
gsub("[[:lower:]][[:upper:]]", " ", places)
[1] "Nort akota"       "Distric  olumbia"
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
Nancy
  • 3,989
  • 5
  • 31
  • 49

2 Answers2

16

Use parentheses to capture the matched expressions, then \n (\\n in R) to retrieve them:

places = c("NorthDakota", "DistrictOfColumbia")
gsub("([[:lower:]])([[:upper:]])", "\\1 \\2", places)
## [1] "North Dakota"         "District Of Columbia"
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Awesome. What if, for example, there was an unknown n? Say some formatting got messed up in transit and one had to do this for thousands of words? – Nancy Jul 14 '14 at 15:49
  • Can you give a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) (not with thousands of words, obviously, but with some reasonable number such as three or four) of what you mean? e.g. is `"SomethingWithMoreThanTwoWords"` (which BTW works fine with the above incantation) an appropriate test case? – Ben Bolker Jul 14 '14 at 15:50
  • Nevermind. I had misinterpreted what the \\1 and \\2 were doing but now I see. – Nancy Jul 14 '14 at 16:01
11

You want to use capturing groups to capture to matched context so you can refer back to each matched group in your replacement call. To access the groups, precede two backslashes \\ followed by the group #.

> places = c('NorthDakota', 'DistrictOfColumbia')
> gsub('([[:lower:]])([[:upper:]])', '\\1 \\2', places)
# [1] "North Dakota"         "District Of Columbia"

Another way, switch on PCRE by using perl=T and use lookaround assertions.

> places = c('NorthDakota', 'DistrictOfColumbia')
> gsub('[a-z]\\K(?=[A-Z])', ' ', places, perl=T)
# [1] "North Dakota"         "District Of Columbia"

Explanation:

The \K escape sequence resets the starting point of the reported match and any previously consumed characters are no longer included. Basically ( throws away everything that it has matched up to that point. )

[a-z]       # any character of: 'a' to 'z'
\K          # '\K' (resets the starting point of the reported match)
(?=         # look ahead to see if there is:
  [A-Z]     #   any character of: 'A' to 'Z'
)           # end of look-ahead
hwnd
  • 69,796
  • 4
  • 95
  • 132