1

I have the following data frame (dat), in which each row is uniquely identified by a person's name.

structure(list(Name = c("John Smith", "Michael Jones", "Eric Stevens", 
"Brian McGee", "Dave Baker"), State = c("NJ", "MA", "LA", "WY", 
"AZ"), City = c("Trenton", "Springfield", "New Orleans", "Cheyenne", 
"Yuma"), DistanceDriven = c("123 km", "15 km", "777 miles", "1029 km", 
"8 miles"), DistanceFromHome = c("115 km", "8 km", "725 miles", 
"1029 km", "8 miles")), class = "data.frame", row.names = c(NA, 
-5L))
> 

Which looks more or less like this:

Name            State      City         Distance Driven     Distance From Home
John Smith        NJ      Trenton       123 km                   115 km
Michael Jones     MA      Springfield    15  km                    8  km
Eric Stevens      LA      New Orleans    777 miles                725 miles
Brian McGee       WY      Cheyenne       1029 km                  1029 km
Dave Baker        AZ       Yuma          8 miles                   8 miles
...

I have a second data frame (dat2), also uniquely identified by Name, that includes only a portion of the names in the initial data set, as well as some new names. However, the data for distance driven and distance from home does not exist (other than the column names).

structure(list(Name = c("John Smith", "Derek Thompson", "Eric Stevens", 
"Dave Baker"), State = c("NJ", "CA", "LA", "AZ"), City = c("Trenton", 
"Los Angeles", "New Orleans", "Yuma"), DistanceDriven = c(NA, 
NA, NA, NA), DistanceFromHome = c(NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, 
-4L))

Which looks something like this:

Name               State      City        Distance Driven         Distance From
John Smith           NJ      Trenton
Derek Thompson       CA     Los Angeles
Eric Stevens         LA     New Orleans  
Dave Baker           AZ       Yuma            

I'm looking to create a new data frame which includes:

  • observations that were in both the first data frame (dat) and in the second data frame
  • observations that were only in the second data set
  • All the data in the rows that would be kept from from the initial data set, as well as all the data in the rows the second set

As such, I really just want to eliminate rows that were only present first data set and not the other. I would thus like the two data frames above to produce:

structure(list(Name = c("John Smith", "Derek Thompson", "Eric Stevens", 
"Dave Baker"), State = c("NJ", "CA", "LA", "AZ"), City = c("Trenton", 
"Los Angeles", "New Orleans", "Yuma"), DistanceDriven = c("123 km", 
"", "777 miles", "8 miles"), DistanceFromHome = c("115 km", "", 
"725 miles", "8 miles")), class = "data.frame", row.names = c(NA, 
-4L))
Name              State   City       Distance Driven         Distance From Home
John Smith         NJ     Trenton            120 km               115 km 
Derek Thompson     CA     Los Angeles
Eric Stevens       LA     New Orleans     777 miles                725 miles
Dave Baker         AZ      Yuma            8 miles                  8 miles

I hope that makes sense. Thanks in advance.

887
  • 599
  • 3
  • 15
  • What you're talking about in data-management is called "join" or "merge". Here are two good links for learning it: https://stackoverflow.com/q/1299871/3358272 and https://stackoverflow.com/a/6188334/3358272. Base R supports it with `merge`, tidyverse with `dplyr::*_join`, and data.table in various ways. For times when merges get a little less discrete (merge on number-ranges or on string-patterns), `fuzzyjoin` is another good package. – r2evans Oct 20 '20 at 15:56
  • 1
    ```df1 %>% filter(Name %in% df2$Name) %>% bind_rows(df2) %>% distinct(.keep_all = TRUE)``` If we don't wanna get into joining the two dataframes, this hacky solution works. – M-- Oct 20 '20 at 16:15

1 Answers1

1

Data:

    df1 <- structure(list(Name = c("John Smith", "Michael Jones", "Eric Stevens", 
    "Brian McGee", "Dave Baker"), State = c("NJ", "MA", "LA", "WY", 
    "AZ"), City = c("Trenton", "Springfield", "New Orleans", "Cheyenne", 
    "Yuma"), DistanceDriven = c("123 km", "15 km", "777 miles", "1029 km", 
    "8 miles"), DistanceFromHome = c("115 km", "8 km", "725 miles", 
    "1029 km", "8 miles")), class = "data.frame", row.names = c(NA, 
    -5L))
    
    df2 <- structure(list(Name = c("John Smith", "Derek Thompson", "Eric Stevens", 
    "Dave Baker"), State = c("NJ", "CA", "LA", "AZ"), City = c("Trenton", 
    "Los Angeles", "New Orleans", "Yuma"), DistanceDriven = c(NA, 
    NA, NA, NA), DistanceFromHome = c(NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, 
    -4L))

Code:

    library(dplyr)
    
    df1 %>% 
      right_join(df2, "Name")

Output:

#>             Name State.x      City.x DistanceDriven.x DistanceFromHome.x
#> 1     John Smith      NJ     Trenton           123 km             115 km
#> 2   Eric Stevens      LA New Orleans        777 miles          725 miles
#> 3     Dave Baker      AZ        Yuma          8 miles            8 miles
#> 4 Derek Thompson    <NA>        <NA>             <NA>               <NA>

#>   State.y      City.y DistanceDriven.y DistanceFromHome.y
#> 1      NJ     Trenton               NA                 NA
#> 2      LA New Orleans               NA                 NA
#> 3      AZ        Yuma               NA                 NA
#> 4      CA Los Angeles               NA                 NA
Eric
  • 2,699
  • 5
  • 17
  • What has happened to `State.y`, `City.y`, etc.? If you don't want them, then you need to use `select` and choose the columns that you want, not simply cutting them out of the results you're showing here. – M-- Oct 21 '20 at 14:47
  • @M-- ‘right_join’ default suffix is .x and .y for non-joined duplicate variables in x and y tables. – Eric Oct 21 '20 at 15:07
  • Correct. So why you're not showing the `State.y`, etc. in the output you have here? That's exactly what I am referring to. You're cutting out part of the output which is misleading. Your code does not give the output you're showing here. To be precise, it will give this, but not just this. – M-- Oct 21 '20 at 15:08
  • Apologies. I must have had an issue with `reprex` formatting. I have updated it. Thanks. – Eric Oct 21 '20 at 15:15