0

I have the following data frame (dat), in which each row is uniquely identified by a person's name.

Name               Distance Driven     Distance From Home
John Smith              123 km                   115 km
Michael Jones           15  km                    8  km
Eric Stevens            777 miles                725 miles
Brian McGee             1029 km                  1029 km
Dave Baker              8 miles                   8 miles

I have a second data frame (dat2), also uniquely identified by Name, that includes only a portion of the names in the initial data set, as well as some new names. However, the data in each row (other than name and the column names) does not exist.

Name               Distance Driven         Distance From Home
John Smith              
Derek Thompson         
Eric Stevens           
Dave Baker              

I'm looking to create a new data frame which includes:

  1. observations that were in both the original data frame (dat) and in the second data frame
  2. observations that were only in the second data set
  3. All the data in the rows from the initial data set

As such, I really just want to eliminate names that were only present first data set and not the other. I would thus like the two data frames above to produce:

Name               Distance Driven         Distance From Home
John Smith             120 km                    115 km 
Derek Thompson        
Eric Stevens           777 miles                725 miles
Dave Baker              

I hope that makes sense. Thanks in advance.

887
  • 599
  • 3
  • 15
  • Get rid of the empty columns from `dat2`, and then you use joins. – Gregor Thomas Oct 19 '20 at 20:15
  • I've marked duplicates for the FAQ about how to do joins in R, and a more general FAQ about what different types of joins are (the concept is common across languages). – Gregor Thomas Oct 19 '20 at 20:17
  • I am going to re-ask this question as I did not include all the data necessary to get a useful answer for my data. I have tried many variations of a left join but it does not solve all of my issues. – 887 Oct 20 '20 at 14:12
  • I look forward to the new version. It will be easier to help if you share sample data with `dput()` so it is copy/pasteable. The data shared in this question has spaces in column names and values, which makes it hard to import, and the blanks in `dat2` are ambiguous (are they missing values `NA` or empty strings `""` or something else?). Using `dput()` will allow us to directly import the data you share, including all class and structure information, and demonstrate a solution on it. – Gregor Thomas Oct 20 '20 at 14:17
  • 1
    Use `dput(dat1)` for a copy/pasteable version of `dat1`, or perhaps on a subset, `dput(dat1[1:10, ])` for the first 10 rows. – Gregor Thomas Oct 20 '20 at 14:18
  • Thank you, I will reformat and re-post the question in a moment. – 887 Oct 20 '20 at 14:26

1 Answers1

0

You want to do a left_join() from the dplyr package.

library(dplyr)
dat2 %>% 
  select(Name) %>%
  left_join(dat1)

Your output will look something like this:

Name               `Distance Driven`         `Distance From Home`
John Smith         120 km                    115 km 
Derek Thompson     NA                        NA   
Eric Stevens       777 miles                 725 miles
Dave Baker         NA                        NA
Ben Norris
  • 5,639
  • 2
  • 6
  • 15
  • I just tried this, and it works for the most part, except that it eliminates other useful information in dat2. As I was typing the question, I left out the fact that there is another column after 'Name' called 'State'. When I run the code above, it eliminates all the data for the 'state' column when the person was in the second data set but not the first. So the state column in the new data set just says "NA" for those observations. Is there a way to ameliorate this? – 887 Oct 20 '20 at 01:26
  • @rogues77 - Just list the columns from `dat2` you want to keep in `select()`. – Ben Norris Oct 20 '20 at 10:21
  • I actually already tried that--when I do it eliminates the data in 'Distance Driven' and 'Distance From Home' info for names in both the first and second data set. So, the resulting data set does not include any values for the 'Distance Driven' and 'Distance From Home' columns. Is there any way to include them? – 887 Oct 20 '20 at 12:44