Split a string into lines based on regex of one or more words followed by two numeric values

Question

Give a string such as this:

x <- c("Carroll 103 215 Albany City 24 41 Allegany 115 231 Charlotte 116 248")

What's the best way to split this into lines such as this:

# [1] Carroll 103 215
# [2] Albany City 24 41
# [3] Allegany 115 231
# [4] Charlotte 116 248

It's the "Albany City" that is giving me trouble. There are other words too that will contain one or more words, (e.g. "Port Jervis City"), however, these should always be followed by numeric values of length 1 or more.

Maybe try something like `str_extract_all(x,"([A-Za-z]+ )+(\\d+ )+")` with the `stringr` package. — nicola, Nov 15 '18 at 14:26
@Nicola this works but returns trailing space and also misses the last numeric value for Charlotte. — JasonAizkalns, Nov 15 '18 at 14:32
@WiktorStribiżew no, always ASCII, and there will always be exactly two numbers, although a more flexible solution would be preferred. — JasonAizkalns, Nov 15 '18 at 14:44
Try `regmatches(x, gregexpr("\\b[A-Za-z][A-Za-z ]*\\d[ \\d]*\\b", x, perl=TRUE))` — Wiktor Stribiżew, Nov 15 '18 at 14:48
@WiktorStribiżew `regmatches(x, gregexpr("\\b[A-Za-z][A-Za-z ]*(?:\\s?\\d+)*\\b", x, perl=TRUE))` ? — Andre Elrico, Nov 15 '18 at 15:09

mrzasa · Answer 1 · 2018-11-15T14:50:32.640

2

You can use str_extract_all that finds all regex matches in a string with a regex:

[A-Za-z ]+(\s\d+)+\s?

Demo

Explanation:

[A-Za-z ]+ matches any number of words separated by spaces
(\s\d+)+ numbers separates by whitespaces
\s? final (possible) whitespace

edited Nov 15 '18 at 14:50

answered Nov 15 '18 at 14:26

mrzasa

22,895
11
56
94

This works but returns a leading space (which I can certainly handle after the fact) for all results except the first one – JasonAizkalns Nov 15 '18 at 14:30
Some grouping may help, `([A-Za-z ]+\s\d+\s\d+)\s?` detailsshown here: https://stackoverflow.com/questions/952275/regex-group-capture-in-r-with-multiple-capture-groups – mrzasa Nov 15 '18 at 14:34
1

This solution works only if each name is followed by exactly two numbers. – nicola Nov 15 '18 at 14:36
you can add `\w` to the start of the patern to exclude the tailing space https://regex101.com/r/4MlO3q/2 – doom87er Nov 15 '18 at 14:46
@nicola: extened to support any number of numbers – mrzasa Nov 15 '18 at 14:50

Andre Elrico · Accepted Answer · 2018-11-15T14:57:47.080

2

You can use ?strsplit from normal R

strsplit(x, "(?<=\\d)\\s(?=[A-Za-z])", perl = T)[[1]]

or

strsplit(x, "(?<=\\d)\\s(?=\\D)", perl = T)[[1]] # less explizit, but much cooler

for both

#[1] "Carroll 103 215"   "Albany City 24 41" "Allegany 115 231"  "Charlotte 116 248"

data:

x = "Carroll 103 215 Albany City 24 41 Allegany 115 231 Charlotte 116 248"

learn more?:

https://regex101.com/r/7cUESK/1

edited Nov 15 '18 at 14:57

answered Nov 15 '18 at 14:50

Andre Elrico

10,956
6
50
69

Split a string into lines based on regex of one or more words followed by two numeric values

2 Answers2