use vector of patterns for str_split, but at the same time exclude certain patterns

Question

Assume the following data:

df <- data.frame(x = c("text, mail", "app.phone", "phone-text-mail", "e-mail", "e-mail, phone"))

I now want to split the text in the x column by several separators/delimiters. Here I want to use the most common ones: ",", ".", "-".

However, this would be problematic for the term "e-mail". So I was wondering if there's any way to create some sort of exclusion list where I could define terms that must not be split.

So this is how I would envision it:

delims <- c(",", ".", "-")
exclusions <- c("e-mail")

library(tidyverse)
df %>%
  mutate(split_x = str_split(x, delims)) # This would also split "e-mail"

So how could I define my patterns in the str_split function so that it would ignore all terms I define in exclusions?

In my real-life example I have many more potential separators, but also many more potential exclusions, so I'm looking for a solution where I could pass my exclusions as a vector. Not sure if this could be done via regex or if I need to 1. search for the existence of my exclusions in any row of the x column and then don't split this row. However, this would problematic for the last example row because this row does contain a valid separator after the occurence of "e-mail".

Expected outcome:

x                split_x_1    split_x_2    split_x_3
text, mail            text         mail           NA
app.phone              app        phone           NA
phone-text-mail      phone         text         mail
e-mail              e-mail           NA           NA
e-mail, phone       e-mail        phone           NA

If you have some hundreds of these exceptions or delimiters that you need to use in a dynamic way, try to avoid regex. — Wiktor Stribiżew, Jan 25 '21 at 21:09
Maybe not hundreds, but potentially a list of 10-50 exclusions. — deschen, Jan 25 '21 at 21:13
Ok, then it make work out with `(*SKIP)(*F)` pattern unless they all or majority of them start with the same prefix (which will increase backtracking and may cause catastrophic backtracking). — Wiktor Stribiżew, Jan 25 '21 at 21:14
Does the order of the split output for each vector matter? Seems like you could do a str_extract for all of your exclusions first, then str_remove them and run a simpler str_split code after. (Or just use the regex approach it it turns out to be small enough) — IceCreamToucan, Jan 25 '21 at 21:19
@IceCreamToucan This could work. The order does not really matter in my case.I'll give it a shot tomorrow. — deschen, Jan 25 '21 at 21:24
@WiktorStribiżew not sure why you closed this question. The one you are linking above is explicitely about regex, while I'm not strictly sticking to a regex solution, i.e. I'm open to alternative solutions like the one from IceCreamToucan suggested above. — deschen, Jan 26 '21 at 14:39
@deschen I am not sure why this was downvoted. The dupe tagged one is a general one and it has nothing to do with the full solution OP asked — akrun, Jan 26 '21 at 18:15
Potentially related question: [Ignore part of a string when splitting using regular expression in R](https://stackoverflow.com/questions/47287204/ignore-part-of-a-string-when-splitting-using-regular-expression-in-r) — Ian Campbell, Jan 28 '21 at 04:38

akrun · Accepted Answer · 2021-01-25T21:11:43.063

1

We could skip them with

lst1 <-  strsplit(df$x, "e-mail(*SKIP)(*F)|[,.-]", perl = TRUE)
df[paste0('split_x_', 1:3)] <- do.call(rbind, lapply(lst1,
         `length<-`, max(lengths(lst1))))

-output

df
#                x split_x_1 split_x_2 split_x_3
#1      text, mail      text      mail      <NA>
#2       app.phone       app     phone      <NA>
#3 phone-text-mail     phone      text      mail
#4          e-mail    e-mail      <NA>      <NA>
#5   e-mail, phone    e-mail     phone      <NA>

edited Jan 25 '21 at 21:11

answered Jan 25 '21 at 21:09

akrun

874,273
37
540
662

You should mention that PCRE patterns have length limit, and it will work for a limited amount of exceptions. See *"In my real-life example I have many more potential separators, but also many more potential exclusions, so I'm looking for a solution where I could pass my exclusions as a vector."* – Wiktor Stribiżew Jan 25 '21 at 21:10

use vector of patterns for str_split, but at the same time exclude certain patterns

1 Answers1