R Strsplit keep delimiter in second element

Question

I have been trying to solve this little issue for almost 2 hours, but without success. I simply want to separate a string by the delimiter: one space followed by any character. In the second element I want to keep the delimiter, whereas in the first element it shall not appear. Example:

 x <- "123123 123 A123"
 strsplit(x," [A-Z]")

results in:

"123123 123" "A123"

However, this does not keep the letter A in the second element. I have tried using

strsplit(x,"(?<=[A-Z])",perl=T)

but this does not really work for my issue. It would also be okay, if there is a space in the second element, it just need the character in it.

Will all the separators you want to keep be surrounded by numbers, not letters (i.e. `123 456`)? Or is it just the first one you want to keep? — Phil, Jun 21 '17 at 11:15

score 4 · Accepted Answer · edited Sep 24 '17 at 05:00

If you want to follow your approach, you need to match 1+ whitespaces followed (i.e. you need a lookahead here) with a letter to consume the whitespaces:

> strsplit(x,"\\s+(?=[A-Z])",perl=T)
[[1]]
[1] "123123 123" "A123"

See the PCRE regex demo.

Details:

\s+ - 1 or more whitespaces (put into the match value and thus will be removed during splitting)
(?=[A-Z]) - the uppercase ASCII letter must appear immediately to the right of the current location, else fail the match (the letter is not part of the match value, and will be kept in the result)

You may also match up to the last non-whitespace char followed with 1+ whitespaces and use \K match reset operator to discard the match before the whitespace:

> strsplit(x,"^.*\\S\\K\\s+",perl=T)
[[1]]
[1] "123123 123" "A123"

If the string contains line breaks, add a DOTALL flag since a dot in a PCRE regex does not match line breaks by default: "(?s)^.*\\S\\K\\s+".

Details:

^ - start of string
.* - any 0+ chars up to the last occurrence of the subsequent subpatterns (that is, \S\s+)
\\S - a non-whitespace
\\K - here, drop all the text matched so far
\\s+ - 1 or more whitespaces.

See another PCRE regex demo.

thanks your code works. however, my issue is not yet solved. i have a vector of string and i want to do this splitting for each record. I used this code: sapply(df$var,function(x){strsplit(x,"\\s+(?<=[A-Z])",perl=T)[[1]][2]}) df$var is a column in a dataframe. If my delimiter is simply " ", then this code works. But for your delimiter, it does not work anymore. What do I have to change? — lorenzbr, Jun 21 '17 at 12:45
@Rnewbie Could you please update the question then? BTW, not `(?<=[A-Z])`, but `(?=[A-Z])`. Use a **lookahead**. — Wiktor Stribiżew, Jun 21 '17 at 12:47

amonk · Answer 2 · 2017-06-21T11:35:16.870

1

I would go with stringi package:

library(stringi)
x <- c("123123 123 A123","34512 321 B521")#some modified input data

l1<-stri_split(x,fixed=" ")
[1] "123123" "123"    "A123"

Then:

lapply(seq_along(1:length(l1)),  function(x) c(paste0(l1[[x]][1]," ",l1[[x]][2]),l1[[x]][3]))

[[1]] 
[1] "123123 123" "A123"      

[[2]]
[1] "34512 321" "B521"

edited Jun 21 '17 at 11:35

answered Jun 21 '17 at 11:23

amonk

1,769
2
18
27

R Strsplit keep delimiter in second element

2 Answers2

Linked