Assume the following data:
df <- data.frame(x = c("text, mail", "app.phone", "phone-text-mail", "e-mail", "e-mail, phone"))
I now want to split the text in the x column by several separators/delimiters. Here I want to use the most common ones: ",", ".", "-".
However, this would be problematic for the term "e-mail". So I was wondering if there's any way to create some sort of exclusion list where I could define terms that must not be split.
So this is how I would envision it:
delims <- c(",", ".", "-")
exclusions <- c("e-mail")
library(tidyverse)
df %>%
mutate(split_x = str_split(x, delims)) # This would also split "e-mail"
So how could I define my patterns in the str_split
function so that it would ignore all terms I define in exclusions
?
In my real-life example I have many more potential separators, but also many more potential exclusions, so I'm looking for a solution where I could pass my exclusions as a vector. Not sure if this could be done via regex or if I need to 1. search for the existence of my exclusions in any row of the x column and then don't split this row. However, this would problematic for the last example row because this row does contain a valid separator after the occurence of "e-mail".
Expected outcome:
x split_x_1 split_x_2 split_x_3
text, mail text mail NA
app.phone app phone NA
phone-text-mail phone text mail
e-mail e-mail NA NA
e-mail, phone e-mail phone NA