0

I have a data.frame ids with an index column of person ID number.

ID
PA1
PA2
PA3
PA4 
PA5 etc

I want to extract the age and sex of each person from 2 other data.frames, df1 and df2which have the details (with different headers), each contains only part of the total sample population, though there maybe some overlap, but also ids contains more rows than both df1 and df2 combines:

df1

ID  Sex  Age  Mode
PA1  M   34   HD
PA3  F   45   PD

and df2

ID  Gender  Age   
PA4  M      67
PA3  F      45
PA2  M      65

In total there are about 1700 rows of data. I want to get the Sex or Gender and Age from df1 and df2 and put them with the correct ID in new columns in ids data.frame, so I get this:

ID   Age   Sex
PA1  34    M
PA2  65    M
PA3 etc...
PA4 
PA5 

df1 and df2 both contain other columns of data which I don't need in ids so I don't think I want to use left_join or merge. I was considering doing something like this with dlplyr:

ids = ids%>%
   mutate("Sex" = case_when(df1$Gender != "" ~df1$Gender, TRUE ~"" ))

and if blank to look in df2 but realise that won't match the values with the correct IDs. The IDs are not currently in the same order in the different data.frames

Mark Davies
  • 787
  • 5
  • 18

1 Answers1

0

You can use left_join and then just select the columns that you need :

library(dplyr)

ids %>%
  left_join(df1 ,by = 'ID') %>%
  left_join(df2, by = 'ID') %>%
  select(ID, Age, Sex)

If there are lot of common columns which are resulting in .x, .y column after joining you can select relevant columns first and then do the join.

ids %>%
  left_join(df1 %>% select(Age, Mode), by = 'ID') %>%
  left_join(df2 %>% select(Sex), by = 'ID')
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • When I do that I end up with two Age coulmns and 2 Sex coulmns: `Age.x` and `Age.y` and `Sex.x` and `Sex.y`. I guess this is because some of the IDs are common to both df1 and df2 – Mark Davies Sep 10 '20 at 10:46
  • I think that is because you have `Age` and `Sex` columns in both `df1` and `df2` ? Maybe you need to add one of them in `select` , `Age.x` or add them in `by` i.e `by = c('ID', 'Sex', 'Age')` ? – Ronak Shah Sep 10 '20 at 10:49
  • Thanks. But I need the Sex and Age data from both `df1` and `df2` because they have different IDs, though some overlap. – Mark Davies Sep 10 '20 at 11:04
  • Could I do left join for one df, and a `for` statement to fill in the gaps? – Mark Davies Sep 10 '20 at 11:04
  • Do you need a `full_join` then so that all the ids are present in the final dataframe? Would it be possible to create a small reproducible example with `ids`, `df1`, and `df2` and show expected output for it so that it is clear what you want to do. – Ronak Shah Sep 10 '20 at 12:09