Combining two datasets, with one kinda shared column?

Question

I have two datasets like so:

training.csv
last_name   ob1   ob2 
Adam        2:01  2:02
Barry, S    3:30  2:50
Barry, D          2:45
Charlie     4:00  
Don         2:00  1:50
Earl        2:50  2:30
Johnson, A  2:57  2:54
Johnson, T  3:15  3:10

and

racing.csv
last_name    first_name   1mile-time   500m-time
Barry        Sue          4:45         1:50
Don          Regan        4:35         0:50
Earl         Sage         4:50         1:30
Johnson      Adam         4:37         1:54
Johnson      Terry        4:50         2:10

So I used merge(training, racing, by = "last_name", all = TRUE) but some people have a shared last name. In the case that a last name was shared, it was entered as last name and first initial separated by a comma.

Another important thing to note, not everyone who goes to training makes the races. So there will be some unique names in training.csv that are not present in racing.csv.

Desired output

last_name   first_name   ob1   ob2   1mile-time   500m-time
Adam        Bob          2:01  2:02
Barry, S    Sue          3:30  2:50   4:45         1:50
Barry, D    Derrick            2:45
Charlie     Charles      4:00  
Don         Regan        2:00  1:50   4:35         0:50
Earl        Sage         2:50  2:30   4:50         1:30
Johnson, A  Adam         2:57  2:54   4:50         2:10
Johnson, T  Terry        3:15  3:10   4:50         2:10

See also this Q&A: [*How to join (merge) data frames (inner, outer, left, right)*](https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right). — Jaap, Jul 13 '20 at 12:36
Does this answer your question? [How to join (merge) data frames (inner, outer, left, right)](https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right) — Martin Gal, Jul 13 '20 at 12:41
Hi everyone. Thank you for the feedback. No, that post does not answer my question. I updated the example data to show what I am referring to more clearly. There is no direct way to join the data on a single column, so I may potentially need an if..else... to join based on the `last_name` and `first_name` columns. — Hustlin, Jul 13 '20 at 12:57

score 1 · Accepted Answer · answered Jul 13 '20 at 13:37

You can try next solution. Using your provided data as inputs we have:

#Data
df1 <- structure(list(last_name = c("Adam", "Barry, S", "Barry, D", 
"Charlie", "Don", "Earl", "Johnson, A", "Johnson, T"), ob1 = c("2:01", 
"3:30", "", "4:00", "2:00", "2:50", "2:57", "3:15"), ob2 = c("2:02", 
"2:50", "2:45", "", "1:50", "2:30", "2:54", "3:10")), class = "data.frame", row.names = c(NA, 
-8L))
df2 <- structure(list(last_name = c("Barry", "Don", "Earl", "Johnson", 
"Johnson"), first_name = c("Sue", "Regan", "Sage", "Adam", "Terry"
), `1mile.time` = c("4:45", "4:35", "4:50", "4:37", "4:50"), 
    `500m.time` = c("1:50", "0:50", "1:30", "1:54", "2:10")), class = "data.frame", row.names = c(NA, 
-5L))

Now the code:

#Format vars
df1$last_name <- trimws(df1$last_name)
df2$last_name <- trimws(df2$last_name)
df2$first_name <- trimws(df2$first_name)
#Create index to check names
index2 <- which(df2$last_name %in% df1$last_name)
df2$Empty <- NA
df2$Empty[index2] <- df2$last_name[index2]
#Replace NA
df2$Empty <- ifelse(is.na(df2$Empty),paste0(df2$last_name,', ',substring(df2$first_name,1,1)),df2$Empty)
#Format data with new id
df3 <- df2
df3$last_name <- df3$Empty
df3$Empty <- NULL
#Now merge
Merged <- merge(df1,df3,by='last_name',all.x=T)
#Arrange
Merged <- Merged[,c(1,4,2,3,5,6)]

You will end up with this:

   last_name first_name  ob1  ob2 1mile.time 500m.time
1       Adam       <NA> 2:01 2:02       <NA>      <NA>
2   Barry, D       <NA>      2:45       <NA>      <NA>
3   Barry, S        Sue 3:30 2:50       4:45      1:50
4    Charlie       <NA> 4:00            <NA>      <NA>
5        Don      Regan 2:00 1:50       4:35      0:50
6       Earl       Sage 2:50 2:30       4:50      1:30
7 Johnson, A       Adam 2:57 2:54       4:37      1:54
8 Johnson, T      Terry 3:15 3:10       4:50      2:10

We used an index to check names and then create a new id to merge. Just remember that in your data not all last names as first names so there would be NA in some first names.

Thank you so much for your answer! Loved that it was base R too. — Hustlin, Jul 13 '20 at 14:40
I can make this a seperate question if need be, but I just noticed some people listed in `racing.csv` file were not in `training.csv` which excludes them in the merge. Is there a way to account for those people? — Hustlin, Jul 13 '20 at 17:41

score 1 · Answer 2 · answered Jul 13 '20 at 13:45

you can add a new columns to both dataframes with last name and intial of first name, but only for disputed rows. Here an attempt in base R:

library(stringr)
training$nameT <- training$last_name

training$nameT <- str_replace(training$nameT, ',', '')
racing$nameT   <- as.character(racing$last_name)
 #find index of non ambiguous last names:
idx <- which( !as.character(racing$last_name) %in% training$nameT   )

racing$nameT[idx] <- paste(racing$last_name[idx], substr(racing$first_name[idx],1,1))

merge(training, racing, by = "nameT", all.x = TRUE)

which will produce the output:

merge(training, racing, by = "nameT", all.x = TRUE)
      nameT last_name.x   ob1   ob2 last_name.y first_name X1mile.time X500m.time
1      Adam        Adam 02:01 02:02        <NA>       <NA>        <NA>       <NA>
2   Barry D    Barry, D 02:45              <NA>       <NA>        <NA>       <NA>
3   Barry S    Barry, S 03:30 02:50       Barry        Sue       04:45      01:50
4   Charlie     Charlie 04:00              <NA>       <NA>        <NA>       <NA>
5       Don         Don 02:00 01:50         Don      Regan       04:35      00:50
6      Earl        Earl 02:50 02:30        Earl       Sage       04:50      01:30
7 Johnson A  Johnson, A 02:57 02:54     Johnson       Adam       04:37      01:54
8 Johnson T  Johnson, T 03:15 03:10     Johnson      Terry       04:50      02:10

you can remove the unwanted columns

Thank you so much for the answer! Your solution also worked but I gave it to @Duck because he answered first and used base R. — Hustlin, Jul 13 '20 at 14:40

Combining two datasets, with one kinda shared column?

2 Answers2