edit:
Sorry y'all, I didn't mean to repost a question. The issue I'm having isn't just with joining two tables, it's joining two tables with a column that isn't exactly the same in both tables (I updated the sample data to illustrate this). That is, I want to pmatch, or str_detect the strings within the Test.Takers$First column with the Every.Student.In.The.Country$First column. I'm not sure how to incorporate pmatch, or str_detect in left_join. If you could point me towards a SO article that covers this, then I would be grateful. My coding lingo is still pretty poor so none of the queries I put into SO lead me to anything useful.
Anyways, I ended up figuring out how to use lapply on my data.frames: it turns out all I had to do was convert each row of the data.frame into an individual list item. I had to tweak the 'matching_name_one_row' function to have only one input to make it work. It's actually WAY slower than the other two codes :'0(
matching_name_one_row <- function(student_df) {
require(dplyr)
require(stringr)
indexmp <- Every.Student.In.The.Country %>% filter(Paternal == as.character(student_df$Paternal), Maternal == as.character(student_df$Maternal))
id_num <- indexmp$id_num[str_detect(indexmp$First, as.character(student_df$First))]
return(id_num[1])
}
rowlist <- list()
for(i in 1:nrow(Test.Takers)) {rowlist[[i]]<- Test.Takers[i,]}
Test.Takers$id_num <- unlist(lapply(rowlist, matching_name_one_row))
Original Question (with updated data):
Test.Takers <- data.frame(
Paternal = c('Last', 'Last','Last', 'Paternal', 'Paternal', "Father's Name"),
Maternal = c('Maternal', 'Maternal', 'Last', 'Maternal', 'Last', "Mother's Name"),
First = c('First', 'Name', 'First', 'Name', 'First', 'BEE'),
id_num = NA,
stringsAsFactors = F)
Every.Student.In.The.Country <- data.frame(
Paternal = c('Last', 'Last', 'Last', 'Paternal', 'Paternal', 'Paternal', "Father's Name"),
Maternal = c('Maternal', 'Last', 'Last', 'Maternal', 'Last', 'Maternal', "Mother's Name"),
First = c('First', 'Name', 'First', 'Name', 'First', 'Something Else', 'BEEMYFRIEND'),
id_num = c(123, 456, 789, 234, 567, 890, 101),
stringsAsFactors = F)
I have two similar data.frames. The first data.frame contains ~30000 names without id_nums and with a lot of other variables that I am omitting. The second data.frame contains ~12000000 names with id_nums. I want to fill the first data.frame with id_nums by matching the names (Paternal, Maternal, and First) in the two data.frames.
I've come up with two solutions, but they are both very slow. The slowest, but easiest to read code is:
matching_name_one_row <- function(student_df, citizen_df) {
require(dplyr)
require(stringr)
indexmp <- citizen_df %>% filter(Paternal == as.character(student_df$Paternal), Maternal == as.character(student_df$Maternal))
id_num <- indexmp$id_num[str_detect(indexmp$First, as.character(student_df$First))]
return(id_num[1])}
for(i in 1:nrow(Test.Takers)) {Test.Takers$id_num[i] = matching_name_one_row(Test.Takers[i,],Every.Student.In.The.Country)}
The function above (matching_name_one_row) only accepts one row of information from the Test.Takers data.frame. I've created it this way because I thought it would make it easier to use the function in a Map() or lapply() function. However, I still don't really understand either Map or lapply so I'm forced to use the code I wrote above. It is very slow...
The (slightly) faster, but more annoying to read code is below:
adding_id <- function(student_df, citizen_df){
require(dplyr)
require(stringr)
#Will hold subsets of last names
indexp <- data.frame(Paternal='name')
indexm <- data.frame(Maternal='name')
for(i in 1:nrow(student_df)) {
#Last names of current observation
namep <- student_df$Paternal[i]
namem <- student_df$Maternal[i]
#Prevents from iterating through the entire citizen_df unnecessarily
if(is.na(as.character(indexp$Paternal[1])) == T | as.character(indexp$Paternal[1]) != namep) {
indexp <- citizen_df %>% filter(Paternal == as.character(student_df$Paternal[i]))
}
#Error occurs when a name does not exist in the citizen file
if(is.na(indexp$Paternal[1]) == F) {
#Prevents from iterating through the entire citizen_df unnecessarily
if(is.na(as.character(indexm$Maternal[1])) == T | as.character(indexm$Maternal[1]) != namem) {
indexm <- indexp %>% filter(Maternal == as.character(student_df$Maternal[i]))
}
#Attach id_num if there is a partial string match for the first name
student_df$id_num[i] <- indexm$id_num[str_detect(indexm$First, as.character(student_df$First[i]))][1]
}
}
#creates a df for students with id_num found and not found
id_found <<- student_df %>% filter(is.na(id_num)==F)
id_not_found <<- student_df %>% filter(is.na(id_num)==T)
}
Both of these codes work, but take at least 11 hours to finish. I'm positive that there are faster ways of getting the same things done by using dplyr, lapply, and Map. For example, I know dplyr has two-table verbs that are probably meant for this kind of variable matching, I just don't know how to implement the two-table verbs. Please help me.