I recently posted: dplyr, lapply, or Map to identify information from one data.frame and place it into another
My main issue involves using dplyr/lapply to combine two data.frames by a column of strings. The strings are first names, but they are not always written exactly the same in both data.frames.
ex. I want 'Jon' in df1 to match with 'Jonathan' in df2 or 'Carol' in df1 to match with 'Caroline' in df2.
#Below data.frame represents a data.frame with ~30000 rows
Test.Takers <- data.frame(
Paternal = c('Last', 'Last','Last', 'Paternal', 'Paternal', "Father's Name"),
Maternal = c('Maternal', 'Maternal', 'Last', 'Maternal', 'Last', "Mother's Name"),
First = c('Carol', 'Name', 'First', 'Name', 'First', 'Jon'),
id_num = NA,
stringsAsFactors = F)
#Below data.frame represents data.frame with ~12000000 rows
Every.Student.In.The.Country <- data.frame(
Paternal = c('Last', 'Last', 'Last', 'Paternal', 'Paternal', 'Paternal', "Father's Name"),
Maternal = c('Maternal', 'Last', 'Last', 'Maternal', 'Last', 'Maternal', "Mother's Name"),
First = c('Caroline', 'Name', 'First', 'Name', 'First', 'Something Else', 'Jonathan'),
id_num = c(123, 456, 789, 234, 567, 890, 101),
stringsAsFactors = F)
I've come up with a lapply function that incorporates str_detect, but it is incredibly slow:
matching_name_one_row <- function(student_df) {
require(dplyr)
require(stringr)
#Filter through massive file with student information by both last names
indexmp <- Every.Student.In.The.Country %>% filter(Paternal == as.character(student_df$Paternal), Maternal == as.character(student_df$Maternal))
#Use str_detect to identify any potential first name matches in filter
id_num <- indexmp$id_num[str_detect(indexmp$First, as.character(student_df$First))]
#Just return first match from str_detect
return(id_num[1])
}
#Create a list of individual rows to use function on
rowlist <- list()
for(i in 1:nrow(Test.Takers)) {rowlist[[i]]<- Test.Takers[i,]}
#Use lapply on list of individual rows
Test.Takers$id_num <- unlist(lapply(rowlist, matching_name_one_row))
dplyr has two-table verbs like left_join that are meant for big data.frames and combining information. However, I don't know how to add a function like str_detect or pmatch into a function like left_join