0

Give a string such as this:

x <- c("Carroll 103 215 Albany City 24 41 Allegany 115 231 Charlotte 116 248")

What's the best way to split this into lines such as this:

# [1] Carroll 103 215
# [2] Albany City 24 41
# [3] Allegany 115 231
# [4] Charlotte 116 248

It's the "Albany City" that is giving me trouble. There are other words too that will contain one or more words, (e.g. "Port Jervis City"), however, these should always be followed by numeric values of length 1 or more.

JasonAizkalns
  • 20,243
  • 8
  • 57
  • 116

2 Answers2

2

You can use str_extract_all that finds all regex matches in a string with a regex:

[A-Za-z ]+(\s\d+)+\s?

Demo

Explanation:

  • [A-Za-z ]+ matches any number of words separated by spaces
  • (\s\d+)+ numbers separates by whitespaces
  • \s? final (possible) whitespace
mrzasa
  • 22,895
  • 11
  • 56
  • 94
  • This works but returns a leading space (which I can certainly handle after the fact) for all results except the first one – JasonAizkalns Nov 15 '18 at 14:30
  • Some grouping may help, `([A-Za-z ]+\s\d+\s\d+)\s?` detailsshown here: https://stackoverflow.com/questions/952275/regex-group-capture-in-r-with-multiple-capture-groups – mrzasa Nov 15 '18 at 14:34
  • 1
    This solution works only if each name is followed by exactly two numbers. – nicola Nov 15 '18 at 14:36
  • you can add `\w` to the start of the patern to exclude the tailing space https://regex101.com/r/4MlO3q/2 – doom87er Nov 15 '18 at 14:46
  • @nicola: extened to support any number of numbers – mrzasa Nov 15 '18 at 14:50
2

You can use ?strsplit from normal R

strsplit(x, "(?<=\\d)\\s(?=[A-Za-z])", perl = T)[[1]]

or

strsplit(x, "(?<=\\d)\\s(?=\\D)", perl = T)[[1]] # less explizit, but much cooler

for both

#[1] "Carroll 103 215"   "Albany City 24 41" "Allegany 115 231"  "Charlotte 116 248"

data:

x = "Carroll 103 215 Albany City 24 41 Allegany 115 231 Charlotte 116 248"

learn more?:

https://regex101.com/r/7cUESK/1

Andre Elrico
  • 10,956
  • 6
  • 50
  • 69