1

edit:

Sorry y'all, I didn't mean to repost a question. The issue I'm having isn't just with joining two tables, it's joining two tables with a column that isn't exactly the same in both tables (I updated the sample data to illustrate this). That is, I want to pmatch, or str_detect the strings within the Test.Takers$First column with the Every.Student.In.The.Country$First column. I'm not sure how to incorporate pmatch, or str_detect in left_join. If you could point me towards a SO article that covers this, then I would be grateful. My coding lingo is still pretty poor so none of the queries I put into SO lead me to anything useful.

Anyways, I ended up figuring out how to use lapply on my data.frames: it turns out all I had to do was convert each row of the data.frame into an individual list item. I had to tweak the 'matching_name_one_row' function to have only one input to make it work. It's actually WAY slower than the other two codes :'0(

matching_name_one_row <- function(student_df) {
    require(dplyr)
    require(stringr)

    indexmp <- Every.Student.In.The.Country %>% filter(Paternal == as.character(student_df$Paternal), Maternal == as.character(student_df$Maternal))
    id_num <- indexmp$id_num[str_detect(indexmp$First, as.character(student_df$First))]
    return(id_num[1])
}


rowlist <- list()
for(i in 1:nrow(Test.Takers)) {rowlist[[i]]<- Test.Takers[i,]}
Test.Takers$id_num <- unlist(lapply(rowlist, matching_name_one_row))

Original Question (with updated data):

Test.Takers <- data.frame(
    Paternal = c('Last', 'Last','Last', 'Paternal', 'Paternal', "Father's Name"),
    Maternal = c('Maternal', 'Maternal', 'Last', 'Maternal', 'Last', "Mother's Name"),
    First = c('First', 'Name', 'First', 'Name', 'First', 'BEE'),
    id_num = NA,
    stringsAsFactors = F)

Every.Student.In.The.Country <- data.frame(
    Paternal = c('Last', 'Last', 'Last', 'Paternal', 'Paternal', 'Paternal', "Father's Name"),
    Maternal = c('Maternal', 'Last', 'Last', 'Maternal', 'Last', 'Maternal', "Mother's Name"),
    First = c('First', 'Name', 'First', 'Name', 'First', 'Something Else', 'BEEMYFRIEND'),
    id_num = c(123, 456, 789, 234, 567, 890, 101),
    stringsAsFactors = F)

I have two similar data.frames. The first data.frame contains ~30000 names without id_nums and with a lot of other variables that I am omitting. The second data.frame contains ~12000000 names with id_nums. I want to fill the first data.frame with id_nums by matching the names (Paternal, Maternal, and First) in the two data.frames.

I've come up with two solutions, but they are both very slow. The slowest, but easiest to read code is:

matching_name_one_row <- function(student_df, citizen_df) {
    require(dplyr)
    require(stringr)

    indexmp <- citizen_df %>% filter(Paternal == as.character(student_df$Paternal), Maternal == as.character(student_df$Maternal))
    id_num <- indexmp$id_num[str_detect(indexmp$First, as.character(student_df$First))]
    return(id_num[1])}


for(i in 1:nrow(Test.Takers)) {Test.Takers$id_num[i] = matching_name_one_row(Test.Takers[i,],Every.Student.In.The.Country)}

The function above (matching_name_one_row) only accepts one row of information from the Test.Takers data.frame. I've created it this way because I thought it would make it easier to use the function in a Map() or lapply() function. However, I still don't really understand either Map or lapply so I'm forced to use the code I wrote above. It is very slow...

The (slightly) faster, but more annoying to read code is below:

adding_id <- function(student_df, citizen_df){

  require(dplyr)
  require(stringr)

  #Will hold subsets of last names
  indexp <- data.frame(Paternal='name')
  indexm <- data.frame(Maternal='name')

  for(i in 1:nrow(student_df)) {

    #Last names of current observation
    namep <- student_df$Paternal[i]
    namem <- student_df$Maternal[i]

    #Prevents from iterating through the entire citizen_df unnecessarily
    if(is.na(as.character(indexp$Paternal[1])) == T | as.character(indexp$Paternal[1]) != namep) {

      indexp <- citizen_df %>% filter(Paternal == as.character(student_df$Paternal[i]))

    }

    #Error occurs when a name does not exist in the citizen file
    if(is.na(indexp$Paternal[1]) == F) {

      #Prevents from iterating through the entire citizen_df unnecessarily
      if(is.na(as.character(indexm$Maternal[1])) == T | as.character(indexm$Maternal[1]) != namem) {

        indexm <- indexp %>% filter(Maternal == as.character(student_df$Maternal[i]))

      }

      #Attach id_num if there is a partial string match for the first name
      student_df$id_num[i] <- indexm$id_num[str_detect(indexm$First, as.character(student_df$First[i]))][1]

    }

  }

  #creates a df for students with id_num found and not found
  id_found <<- student_df %>% filter(is.na(id_num)==F)
  id_not_found <<- student_df %>% filter(is.na(id_num)==T)

}

Both of these codes work, but take at least 11 hours to finish. I'm positive that there are faster ways of getting the same things done by using dplyr, lapply, and Map. For example, I know dplyr has two-table verbs that are probably meant for this kind of variable matching, I just don't know how to implement the two-table verbs. Please help me.

Community
  • 1
  • 1
beemyfriend
  • 85
  • 1
  • 11

1 Answers1

1

You are on the right track. The dplyr was designed for problems like this. You will want to research the join functions, but for what you describe the left_join should be the correct version.

library(dplyr)
left_join(Test.Takers, Every.Student.In.The.Country, by=c("Paternal", "Maternal", "First"))

This will now add the id column from the Every.Student.In.The.Country data frame onto the Test.Takers data frame.

Dave2e
  • 22,192
  • 18
  • 42
  • 50
  • This is incredibly helpful and definitely helps a lot! Thank you! However, sometimes the 'First' names in both data frames don't match exactly. I would like to match the 'First' names with a function like pmatch or str_detect. Is there a way I can put that function into the left_join::by? – beemyfriend Jun 30 '16 at 03:48
  • I don't know if that is possible, pmatch has the potential of returning more than one result, that is a challenge. Since this question has been marked as a duplicate, it is unlikely to have additional comments. I suggest creating a new question using this as a starting point. – Dave2e Jun 30 '16 at 12:24
  • I made an edit. If I don't get any more responses, then I will write a new question. Thank you for the advice! (I'm hoping to resolve the problem regarding multiple outputs by only using the first output of pmatch/str_detect: ) – beemyfriend Jun 30 '16 at 18:47