3

I would like to split a character by multiple delimiters defined in a vector:

text1   <- "aweoiutw839572/)(&2aslk2468" 
text2   <- "147we547iu5erhg24tzu" 
dat <-  rbind(text1, text2)
vector <- c("we", "iu", "24")

The result should be:

var1 del1 var2 del2  var3                del3 var4
a    we   o    iu    tw839572/)(&2aslk   24   68
147  we   547  iu    5erhg               24   tzu

Any ideas with strsplit ?

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
tobias sch
  • 369
  • 2
  • 15
  • possible duplcate of https://stackoverflow.com/q/4350440/4137985 and https://stackoverflow.com/q/7069076/4137985 – Cath Jul 19 '18 at 06:29

3 Answers3

6

We can use strsplit here with lookarounds using the following pattern:

(?<=we|is|24)|(?<=.)(?=we|iu|24)

The basic idea of the above regex is that a split should happen whenever, at the current position, a we|is|24 precedes or proceeds. Of important note is the extra lookbehind (?<=.) on the left side of the outer alternation. This is needed because of a quirk in the way R implemented lookaheads in strsplit. See here for more information about that.

text1 <- "aweoiutw839572/)(&2aslk2468"
vector <- c("we", "iu", "24")
terms <- paste0(vector, collapse="|")
regex <- paste0("(?<=", terms, ")|(?<=.)(?=", terms, ")")

strsplit(text1, regex, perl=TRUE)

[[1]]
[1] "a"                 "we"                "o"                
[4] "iu"                "tw839572/)(&2aslk" "24"               
[7] "68"               

Demo

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • Nice! I was thinking of something similar, but I couldn't quite figure out the lookahead. Could you explain why it has to be prefixed with the "any" lookbehind `(?<=.)`? – Mikko Marttila Jul 19 '18 at 06:35
  • 1
    @MikkoMarttila I just updated my answer, with a link which discusses this. Try removing `(?<=.)` and you will see it fail. – Tim Biegeleisen Jul 19 '18 at 06:36
  • Yes thanks, read my mind :D I did give it a go, and was confused, esp. since testing the regex on e.g. regex101.com seemed like it should work. I'll have a look at the link! – Mikko Marttila Jul 19 '18 at 06:37
3

You can use gsub after pasting together the vector to obtain (we|iu|24). This is the pattern we need Thus we paste(vector,collapse = "|") to get we|iu|24 then we paste ( and ). we the capture any of this as group 1 and replace that with a backreference \\1. we lastly use the read.table function

 read.table(text=gsub(paste0("(",paste(vector,collapse = "|"),")")," \\1 ",dat))

   V1 V2  V3 V4                V5 V6  V7
1   a we   o iu tw839572/)(&2aslk 24  68
2 147 we 547 iu             5erhg 24 tzu
Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • You can include the names: `read.table(text=gsub(paste0("(",paste(vector,collapse = "|"),")")," \\1 ",dat),col.names= paste0(c("Var","del"),rep(1:4,each=2,length=7)))` – Onyambu Jul 19 '18 at 18:26
1

Inspired by Onyambu with focus on tractability:

library(magrittr)
vecapsed <- sprintf("(%s)", paste(vector, collapse = "|"))
# "(we|iu|24)"

dats <- gsub(vecapsed, "|\\1|", dat[, 1]) %>% 
  strsplit(., "|", fixed = TRUE) %>%
  do.call(rbind, .)

# resulting in:
      [,1]  [,2] [,3]  [,4] [,5]                [,6] [,7] 
text1 "a"   "we" "o"   "iu" "tw839572/)(&2aslk" "24" "68" 
text2 "147" "we" "547" "iu" "5erhg"             "24" "tzu"

# The column names:
del <- apply(dats, 2, function(x) all(x %in% vector))
colnames(dats) <- make.unique(ifelse(del, "del", "var"))

      var   del  var.1 del.1 var.2               del.2 var.3
text1 "a"   "we" "o"   "iu"  "tw839572/)(&2aslk" "24"  "68" 
text2 "147" "we" "547" "iu"  "5erhg"             "24"  "tzu"
s_baldur
  • 29,441
  • 4
  • 36
  • 69