2

I recently posted: dplyr, lapply, or Map to identify information from one data.frame and place it into another

My main issue involves using dplyr/lapply to combine two data.frames by a column of strings. The strings are first names, but they are not always written exactly the same in both data.frames.

ex. I want 'Jon' in df1 to match with 'Jonathan' in df2 or 'Carol' in df1 to match with 'Caroline' in df2.

#Below data.frame represents a data.frame with ~30000 rows
Test.Takers <- data.frame(
    Paternal = c('Last', 'Last','Last', 'Paternal', 'Paternal', "Father's Name"),
    Maternal = c('Maternal', 'Maternal', 'Last', 'Maternal', 'Last', "Mother's Name"),
    First = c('Carol', 'Name', 'First', 'Name', 'First', 'Jon'),
    id_num = NA,
    stringsAsFactors = F)

#Below data.frame represents data.frame with ~12000000 rows
Every.Student.In.The.Country <- data.frame(
    Paternal = c('Last', 'Last', 'Last', 'Paternal', 'Paternal', 'Paternal', "Father's Name"),
    Maternal = c('Maternal', 'Last', 'Last', 'Maternal', 'Last', 'Maternal', "Mother's Name"),
    First = c('Caroline', 'Name', 'First', 'Name', 'First', 'Something Else', 'Jonathan'),
    id_num = c(123, 456, 789, 234, 567, 890, 101),
    stringsAsFactors = F)

I've come up with a lapply function that incorporates str_detect, but it is incredibly slow:

matching_name_one_row <- function(student_df) {
    require(dplyr)
    require(stringr)

    #Filter through massive file with student information by both last names
    indexmp <- Every.Student.In.The.Country %>% filter(Paternal == as.character(student_df$Paternal), Maternal == as.character(student_df$Maternal))

    #Use str_detect to identify any potential first name matches in filter
    id_num <- indexmp$id_num[str_detect(indexmp$First, as.character(student_df$First))]

    #Just return first match from str_detect 
    return(id_num[1])
}

#Create a list of individual rows to use function on
rowlist <- list()
for(i in 1:nrow(Test.Takers)) {rowlist[[i]]<- Test.Takers[i,]}

#Use lapply on list of individual rows
Test.Takers$id_num <- unlist(lapply(rowlist, matching_name_one_row))

dplyr has two-table verbs like left_join that are meant for big data.frames and combining information. However, I don't know how to add a function like str_detect or pmatch into a function like left_join

Community
  • 1
  • 1
beemyfriend
  • 85
  • 1
  • 11

0 Answers0