Extracting substrings beginning with XX.XXXX

Question

I have a string

x <- "24.3483 stuff stuff 34.8325 some more stuff"

The [0-9]{2}\\.[0-9]{4} is what denotes the beginning of each part of each substring I would like to extract. For the above example, I would like the output to be equivalent to

[1] "24.3483 stuff stuff"     "34.8325 some more stuff"

I've already looked at R split on delimiter (split) keep the delimiter (split):

> unlist(strsplit(x, "(?<=[[0-9]{2}\\.[0-9]{4}])", perl=TRUE))
[1] "24.3483 stuff stuff 34.8325 some more stuff"

which isn't what I want, as well as How should I split and retain elements using strsplit?.

it might help to add a more difficult example, e.g. one that has digits in the intervening text — Ben Bolker, Aug 14 '19 at 12:54

score 4 · Accepted Answer · answered Aug 14 '19 at 12:47

You may use

x <- "24.3483 stuff stuff 34.8325 some more stuff"
unlist(strsplit(x, "\\s+(?=[0-9]{2}\\.[0-9]{4})", perl=TRUE))
[1] "24.3483 stuff stuff"     "34.8325 some more stuff"

See the regex demo and the R demo.

Details

\s+ - 1+ whitespaces (this should prevent a match at the start of the string, you may replace it with \\s*\\b if the matches can have no whitespaces before)
(?=[0-9]{2}\.[0-9]{4}) - a positive lookahead that requires (does not consume the text!) 2 digits, ., and 4 digits immediately to the right of the current location.

I was just preparing to post this (without the `+`, which is a nice touch). — Roland, Aug 14 '19 at 12:51

Ben Bolker · Answer 2 · 2019-08-14T13:02:00.043

2

If you're sure there won't be digits in the intervening text ...

stringr::str_extract_all(x, "[0-9]{2}\\.[0-9]{4}[^0-9]+")

(this includes an extra space, you could use trimws())

Alternatively you can use stringr::str_locate_all() to find starting positions. It's a little clunky but ...

pos <- stringr::str_locate_all(x, "[0-9]{2}\\.[0-9]{4}")[[1]][,"start"]
pos <- c(pos,nchar(x)+1)
Map(substr,pos[-length(pos)],pos[-1]-1,x=x)

edited Aug 14 '19 at 13:02

answered Aug 14 '19 at 12:46

Ben Bolker

211,554
25
370
453

There unfortunately are. To clarify, the XX.XXXX pattern is for sure what denotes the beginning of each substring. There may be numbers in the substrings themselves other than in the beginnings of each substring, but the XX.XXXX pattern does not appear elsewhere other than in the beginnings. – Clarinetist Aug 14 '19 at 12:47

The fourth bird · Answer 3 · 2019-08-14T12:58:15.530

0

You could use your pattern followed by matching not a digit \D+ and assert at the end what is on the right is not a non whitespace char (?!\S)

\b[0-9]{2}\.[0-9]{4}.*?(?=\b[0-9]{2}\.[0-9]{4}|$)

\b Word bounary
[0-9]{2}\.[0-9]{4} Match 2 digits, dot and 4 digits
.*? Match any char 0+ times non greedy
(?=\b[0-9]{2}\.[0-9]{4}|$) Assert what is on the right is the initial pattern or the end of the string

Regex demo | R demo

x <- "24.3483 stuff stuff 34.8325 some more stuff"
stringr::str_extract_all(x, "\\b[0-9]{2}\\.[0-9]{4}.*?(?=\\b[0-9]{2}\\.[0-9]{4}|$)")

edited Aug 14 '19 at 12:58

answered Aug 14 '19 at 12:43

The fourth bird

154,723
16
55
70

I think this has the same limitation that my first solution does ... ? what if there are digits in the intervening text ... ? – Ben Bolker Aug 14 '19 at 12:54
@BenBolker I see, then using a non greedy quantifier could also do it. I have updated the answer. – The fourth bird Aug 14 '19 at 12:58

score 0 · Answer 4 · answered Aug 14 '19 at 12:47

0

If you don't mind putting your data into a dataframe/tibble you can use the following:

library(tidyverse)
x <- tibble(data = c("24.3483 stuff stuff 34.8325 some more stuff"))

x %>% mutate(data_split = str_extract_all(data,
                                          pattern = "\\d{2}\\.\\d{4}[^(\\d{2}\\.\\d{4})]+"))

You will end up with a list column whose entries are the split parts of your string.

answered Aug 14 '19 at 12:47

jludewig

428
2
8

and of course you don't need to put your data into a dataframe, as seen in Ben's post.. – jludewig Aug 14 '19 at 12:48

Extracting substrings beginning with XX.XXXX

4 Answers4