0

I have two datasets:

DS1 - contains a list of subjects with a columns for name, ID number and Employment status

DS2 - contains the same list of subjects names and ID numbers but some of these are missing on the second data set.
Finally it contains a 3rd column for Education Level.

I want to merge the Education column onto the first dataset. I have done this using the merge function sorting by ID number but because some of the ID numbers are missing on the second data set I want to merge the remaining Education level by name as a secondary option. Is there a way to do this using dplyr/tidyverse?

NomNonYon
  • 87
  • 6
  • Does not appear to be a duplicate of the flagged question - this question is about joining on different fields depending on whether the ID number is present or not in that row – anotherfred Oct 31 '19 at 16:56
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Oct 31 '19 at 16:58
  • 1
    Does not seem to be duplicate. OP, I believe what you can do is; first ```left_join``` by ID and subsequently ```left_join``` by Name. Now, you will have two 'Education level' columns. Where you find the first column to be NA, you can use data from the second 'Education level' column. – makeshift-programmer Oct 31 '19 at 16:59

2 Answers2

0

There are two ways you can do this. Choose the one based on your preference.

1st option:

#here I left join twice and select columns each time to ensure there is no duplication like '.x' '.y'
finalDf = DS1 %>% 
  dplyr::left_join(DS2 %>% 
                     dplyr::select(ID,EducationLevel1=EducationLevel),by=c('ID')) %>% 
  dplyr::left_join(DS2 %>% 
                     dplyr::select(Name,EducationLevel2=EducationLevel),by=c('Name')) %>% 
  dplyr::mutate(FinalEducationLevel = ifelse(is.na(EducationLevel1),EducationLevel2,EducationLevel1))

2nd option:

#first find the IDs which are present in the 2nd dataset

commonIds = DS1 %>% 
  dplyr::inner_join(DS2 %>% 
                      dplyr::select(ID,EducationLevel),by=c('ID'))

#now the records where ID was not present in DS2

idsNotPresent = DS1 %>% 
  dplyr::filter(!ID %in% commonIds$ID) %>% 
  dplyr::left_join(DS2 %>% 
                     dplyr::select(Name,EducationLevel),by=c('Name'))

#bind these two dfs to get the final df

finalDf = bind_rows(commonIds,idsNotPresent)

Let me know if this works.

  • Thanks for the reply. So the second option did kind of work for but ended up creating duplicate rows and columns. So I ended up having double the amount of rows, all with NA except for the Education column. What I actually need is to create a function I can apply to other datasets and I'm thinking there might be way with if/else statements. I'm just not too familiar with them in R. I will take the time later to make a mock up of my tables and output and hopefully that will make things easier – NomNonYon Nov 01 '19 at 11:47
0

The second option in makeshift-programmer's answer worker for me. Thank you so much. Had to play around with it for my actual data sets but the basic structure worked very well and it was easy to adapt

NomNonYon
  • 87
  • 6