1

I have a dataset with emails like:

my_df <- data.frame(email = c("mirko@asdoi.com", "elsa@asodida.co.uk", "elsapina@asoqw.com"))

And I have an open source dataset like:

open_data <- data.frame(name = c("mirko", "elsa", "pina"), gender = c("male", "female", "male")
  1. How can I perform a lookup of my_df with open_data to associate the gender to each email?
  2. In the case of multiple join, I want it to create multiple records

The result should be:

result <- data.frame(email = c("mirko@asdoi.com", "elsa@asodida.co.uk", "elsapina@asoqw.com", "elsapina@asoqw.com"), gender = c("male", "female", "female", "male))
Seymour
  • 3,104
  • 2
  • 22
  • 46

2 Answers2

2

One option is to use the sqldf library and solve this via a database style join between the two data frames:

library(sqldf)
my_df$name <- sub("@.*$", "", my_df$email)
sql <- "select t1.email, t2.gender from my_df t1 inner join open_data t2 "
sql <- paste0(sql, "on t1.name like '%' || t2.name || '%'")
result <- sqldf(sql)
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • smart idea therefore I validate your answer. In the end, I used another solution (slightly adapted from the first answer to https://stackoverflow.com/questions/32914357/dplyr-inner-join-with-a-partial-string-match) that I will share in the next days. – Seymour Jul 02 '18 at 09:37
0

Perhaps something along these lines? Not sure how robust this will be for more complex cases though.

library(tidyverse)
open_data %>%
    rowwise() %>%
    mutate(email = list(grep(name, my_df$email))) %>%
    unnest() %>%
    mutate(email = my_df$email[email])
## A tibble: 4 x 3
#  name  gender email
#  <fct> <fct>  <fct>
#1 mirko male   mirko@asdoi.com
#2 elsa  female elsa@asodida.co.uk
#3 elsa  female elsapina@asoqw.com
#4 pina  male   elsapina@asoqw.com

Explanation: We use grep to find matches of open_data$name in my_df$email; then unnest to expand multiple matches, and use row indices to extract email entries.

Maurits Evers
  • 49,617
  • 4
  • 47
  • 68