0

Assume the following data:

df <- data.frame(x = c("text, mail", "app.phone", "phone-text-mail", "e-mail", "e-mail, phone"))

I now want to split the text in the x column by several separators/delimiters. Here I want to use the most common ones: ",", ".", "-".

However, this would be problematic for the term "e-mail". So I was wondering if there's any way to create some sort of exclusion list where I could define terms that must not be split.

So this is how I would envision it:

delims <- c(",", ".", "-")
exclusions <- c("e-mail")

library(tidyverse)
df %>%
  mutate(split_x = str_split(x, delims)) # This would also split "e-mail"

So how could I define my patterns in the str_split function so that it would ignore all terms I define in exclusions?

In my real-life example I have many more potential separators, but also many more potential exclusions, so I'm looking for a solution where I could pass my exclusions as a vector. Not sure if this could be done via regex or if I need to 1. search for the existence of my exclusions in any row of the x column and then don't split this row. However, this would problematic for the last example row because this row does contain a valid separator after the occurence of "e-mail".

Expected outcome:

x                split_x_1    split_x_2    split_x_3
text, mail            text         mail           NA
app.phone              app        phone           NA
phone-text-mail      phone         text         mail
e-mail              e-mail           NA           NA
e-mail, phone       e-mail        phone           NA
deschen
  • 10,012
  • 3
  • 27
  • 50
  • If you have some hundreds of these exceptions or delimiters that you need to use in a dynamic way, try to avoid regex. – Wiktor Stribiżew Jan 25 '21 at 21:09
  • Maybe not hundreds, but potentially a list of 10-50 exclusions. – deschen Jan 25 '21 at 21:13
  • Ok, then it make work out with `(*SKIP)(*F)` pattern unless they all or majority of them start with the same prefix (which will increase backtracking and may cause catastrophic backtracking). – Wiktor Stribiżew Jan 25 '21 at 21:14
  • 1
    Does the order of the split output for each vector matter? Seems like you could do a str_extract for all of your exclusions first, then str_remove them and run a simpler str_split code after. (Or just use the regex approach it it turns out to be small enough) – IceCreamToucan Jan 25 '21 at 21:19
  • @IceCreamToucan This could work. The order does not really matter in my case.I'll give it a shot tomorrow. – deschen Jan 25 '21 at 21:24
  • 1
    @WiktorStribiżew not sure why you closed this question. The one you are linking above is explicitely about regex, while I'm not strictly sticking to a regex solution, i.e. I'm open to alternative solutions like the one from IceCreamToucan suggested above. – deschen Jan 26 '21 at 14:39
  • @deschen I am not sure why this was downvoted. The dupe tagged one is a general one and it has nothing to do with the full solution OP asked – akrun Jan 26 '21 at 18:15
  • 2
    Potentially related question: [Ignore part of a string when splitting using regular expression in R](https://stackoverflow.com/questions/47287204/ignore-part-of-a-string-when-splitting-using-regular-expression-in-r) – Ian Campbell Jan 28 '21 at 04:38

1 Answers1

1

We could skip them with

lst1 <-  strsplit(df$x, "e-mail(*SKIP)(*F)|[,.-]", perl = TRUE)
df[paste0('split_x_', 1:3)] <- do.call(rbind, lapply(lst1,
         `length<-`, max(lengths(lst1))))

-output

df
#                x split_x_1 split_x_2 split_x_3
#1      text, mail      text      mail      <NA>
#2       app.phone       app     phone      <NA>
#3 phone-text-mail     phone      text      mail
#4          e-mail    e-mail      <NA>      <NA>
#5   e-mail, phone    e-mail     phone      <NA>
akrun
  • 874,273
  • 37
  • 540
  • 662
  • You should mention that PCRE patterns have length limit, and it will work for a limited amount of exceptions. See *"In my real-life example I have many more potential separators, but also many more potential exclusions, so I'm looking for a solution where I could pass my exclusions as a vector."* – Wiktor Stribiżew Jan 25 '21 at 21:10