3

I have a string

x <- "24.3483 stuff stuff 34.8325 some more stuff"

The [0-9]{2}\\.[0-9]{4} is what denotes the beginning of each part of each substring I would like to extract. For the above example, I would like the output to be equivalent to

[1] "24.3483 stuff stuff"     "34.8325 some more stuff"

I've already looked at R split on delimiter (split) keep the delimiter (split):

> unlist(strsplit(x, "(?<=[[0-9]{2}\\.[0-9]{4}])", perl=TRUE))
[1] "24.3483 stuff stuff 34.8325 some more stuff"

which isn't what I want, as well as How should I split and retain elements using strsplit?.

Clarinetist
  • 1,097
  • 18
  • 46

4 Answers4

4

You may use

x <- "24.3483 stuff stuff 34.8325 some more stuff"
unlist(strsplit(x, "\\s+(?=[0-9]{2}\\.[0-9]{4})", perl=TRUE))
[1] "24.3483 stuff stuff"     "34.8325 some more stuff"

See the regex demo and the R demo.

Details

  • \s+ - 1+ whitespaces (this should prevent a match at the start of the string, you may replace it with \\s*\\b if the matches can have no whitespaces before)
  • (?=[0-9]{2}\.[0-9]{4}) - a positive lookahead that requires (does not consume the text!) 2 digits, ., and 4 digits immediately to the right of the current location.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
2

If you're sure there won't be digits in the intervening text ...

stringr::str_extract_all(x, "[0-9]{2}\\.[0-9]{4}[^0-9]+")

(this includes an extra space, you could use trimws())

Alternatively you can use stringr::str_locate_all() to find starting positions. It's a little clunky but ...

pos <- stringr::str_locate_all(x, "[0-9]{2}\\.[0-9]{4}")[[1]][,"start"]
pos <- c(pos,nchar(x)+1)
Map(substr,pos[-length(pos)],pos[-1]-1,x=x)
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • There unfortunately are. To clarify, the XX.XXXX pattern is for sure what denotes the beginning of each substring. There may be numbers in the substrings themselves other than in the beginnings of each substring, but the XX.XXXX pattern does not appear elsewhere other than in the beginnings. – Clarinetist Aug 14 '19 at 12:47
0

You could use your pattern followed by matching not a digit \D+ and assert at the end what is on the right is not a non whitespace char (?!\S)

\b[0-9]{2}\.[0-9]{4}.*?(?=\b[0-9]{2}\.[0-9]{4}|$)
  • \b Word bounary
  • [0-9]{2}\.[0-9]{4} Match 2 digits, dot and 4 digits
  • .*? Match any char 0+ times non greedy
  • (?=\b[0-9]{2}\.[0-9]{4}|$) Assert what is on the right is the initial pattern or the end of the string

Regex demo | R demo

x <- "24.3483 stuff stuff 34.8325 some more stuff"
stringr::str_extract_all(x, "\\b[0-9]{2}\\.[0-9]{4}.*?(?=\\b[0-9]{2}\\.[0-9]{4}|$)")
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
0

If you don't mind putting your data into a dataframe/tibble you can use the following:

library(tidyverse)
x <- tibble(data = c("24.3483 stuff stuff 34.8325 some more stuff"))

x %>% mutate(data_split = str_extract_all(data,
                                          pattern = "\\d{2}\\.\\d{4}[^(\\d{2}\\.\\d{4})]+"))

You will end up with a list column whose entries are the split parts of your string.

jludewig
  • 428
  • 2
  • 8