0

I have a column with unique email domains. (e.g. @comp1.com, @comp2.com, ...)

I have another dataset with an email column, that includes many emails. Some of the domains will be prensent in the domain df, some will not.

I would like to create a new column "target_email", where it would return TRUE if the email is part of those targeted domains, and FALSE if not.

I have tried:

df$target_email<-grepl(domain$Email, df$Email)

df$target_email<-ifelse(grepl(domain$Email, df$Email), "TRUE", "FALSE")

df$target_email<-sapply(domain$Email, \(string) any(grepl(string, df$target_email, fixed = TRUE)))

These all return an error:

argument 'pattern' has length > 1 and only the first element will be used

or

replacement has 160 rows, data has 28446

Edit: Let's say we want to isolate emails that belong to a FAANG company

df$email<-c("matt@apple.com", "tash@amazon.com", "a@coke.com", "b@netflix.com", "c@pepsi.com")

domains$email<-c("apple.com", "netflix.com", "amazon.com", "google.com")

I want:
df$target_email<-c("True", "True", "False", "True", "False")
lala345
  • 129
  • 6
  • 4
    Could you edit your question to include a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? – jpsmith Mar 20 '23 at 12:49
  • 1
    Maybe: `rowSums(sapply(domain$Email, \(string) grepl(string, df$target_email, fixed = TRUE))) > 0` – GKi Mar 20 '23 at 12:51

6 Answers6

1

To combine all patterns into a single regex string, concatenate them with the pipe symbol |. That way, str_detect() returns TRUE whenever one of the domains is matched.

library(tidyverse)

domains <- c("apple.com", "netflix.com", "amazon.com", "google.com")

df <- tibble(
  email = c("matt@apple.com", "tash@amazon.com", "a@coke.com", "b@netflix.com", "c@pepsi.com")
)

pattern <- str_flatten(domains, collapse = "$|")

df |> 
  mutate(target_email = str_detect(email, pattern))
#> # A tibble: 5 × 2
#>   email           target_email
#>   <chr>           <lgl>       
#> 1 matt@apple.com  TRUE        
#> 2 tash@amazon.com TRUE        
#> 3 a@coke.com      FALSE       
#> 4 b@netflix.com   TRUE        
#> 5 c@pepsi.com     FALSE

Created on 2023-03-20 with reprex v2.0.2

dufei
  • 2,166
  • 1
  • 7
  • 18
0

You can use in addition rowSums.

df$target_email <- rowSums(sapply(domains, \(string)
     grepl(string, df$email, fixed = TRUE))) > 0

df
#            email target_email
#1  matt@apple.com         TRUE
#2 tash@amazon.com         TRUE
#3      a@coke.com        FALSE
#4   b@netflix.com         TRUE
#5     c@pepsi.com        FALSE

But maybe using instead of grepl endsWith or use Reduce instead of rowSums.

rowSums(sapply(domains, \(string) endsWith(df$email, string))) > 0

Reduce(\(b,s) b | endsWith(df$email, s), domains, FALSE)

Data

domains <- c("apple.com", "netflix.com", "amazon.com", "google.com")

df <- data.frame(
  email = c("matt@apple.com", "tash@amazon.com", "a@coke.com", "b@netflix.com", "c@pepsi.com")
)

Benchmark

bench::mark(
"Reduce" = Reduce(\(b,s) b | endsWith(df$email, s), domains, FALSE),
"rowSums" = rowSums(sapply(domains, \(string) endsWith(df$email, string))) > 0,
"%in%" = gsub(".*@", "", df$email) %in% domains,
"str_detect/flatten" = stringr::str_detect(df$email, stringr::str_flatten(domains, collapse = "$|")),
"str_detect/str_c" = stringr::str_detect(df$email, stringr::str_c(domains, collapse = "|"))
)
#  expression             min median itr/s…¹ mem_al…² gc/se…³ n_itr  n_gc total…⁴
#  <bch:expr>         <bch:t> <bch:>   <dbl> <bch:by>   <dbl> <int> <dbl> <bch:t>
#1 Reduce                17µs 19.7µs  45661.   8.18KB   13.7   9997     3   219ms
#2 rowSums             44.2µs 48.2µs  20659.   6.23KB   12.4   9994     6   484ms
#3 %in%                14.1µs   15µs  65891.       0B    6.59  9999     1   152ms
#4 str_detect/flatten  50.7µs 53.5µs  18503.     264B    8.14  9089     4   491ms
#5 str_detect/str_c      71µs 74.5µs  13307.     528B    8.15  6533     4   491ms

Using %in% with sub is in this case fastest and uses lowest amount of memory.

GKi
  • 37,245
  • 2
  • 26
  • 48
0

Here's a base R solution:

df$target_email <- gsub(".*@", "", df$email) %in% domains

# A tibble: 5 × 2
  email           target_email
  <chr>           <lgl>       
1 matt@apple.com  TRUE        
2 tash@amazon.com TRUE        
3 a@coke.com      FALSE       
4 b@netflix.com   TRUE        
5 c@pepsi.com     FALSE
Matt
  • 7,255
  • 2
  • 12
  • 34
0
library(dplyr)
library(stringr)

df %>% 
  mutate(
    domain = str_extract(email, "(?<=@)[^\\.]+\\.[^\\.]+$"),
    target_email = if_else(domain %in% domains$email, "True", "False")
  )

data:

df <- data.frame(email = c("matt@apple.com", "tash@amazon.com", "a@coke.com", "b@netflix.com", "c@pepsi.com"))

domains <- data.frame(email = c("apple.com", "netflix.com", "amazon.com", "google.com"))

TarJae
  • 72,363
  • 6
  • 19
  • 66
0

You can use str_detect (or grepl, of course) and form the domains into an alternation pattern:

library(tidyverse)
df %>%
  mutate(target = str_detect(email, str_c(domains, collapse = "|")))
            email target
1  matt@apple.com   TRUE
2 tash@amazon.com   TRUE
3      a@coke.com  FALSE
4   b@netflix.com   TRUE
5     c@pepsi.com  FALSE

Data:

df <- data.frame(email = c("matt@apple.com", "tash@amazon.com", "a@coke.com", "b@netflix.com", "c@pepsi.com"))
domains<-c("apple.com", "netflix.com", "amazon.com", "google.com")
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
0
df <- data.frame(email = c("matt@apple.com", "tash@amazon.com", "a@coke.com", "b@netflix.com", "c@pepsi.com"))

domains <- c("apple.com", "netflix.com", "amazon.com", "google.com")

# just return TRUE or FALSE
df %>%
  mutate(target_email = gsub(".+@(.+)", "\\1", email) %in% domains)

# if you really want character strings for "True" and "False"
df %>%
  mutate(target_email = factor(gsub(".+@(.+)", "\\1", email) %in% domains, levels = c(T, F), labels = c("True", "False")))
Merijn van Tilborg
  • 5,452
  • 1
  • 7
  • 22