0

I want to create a new variable (Sex) in a dataframe by matching up the ID's between my main data frame and a reference data frame which contains information about the sex of each individual.

I have the following code that works - but as my data frame is over 6 million rows long it is taking over 10 hours to run.

data.df$Sex_Varb <- NA

for(i in 1:nrow(data.df)){
  find.match <- which(data.df$ID_Varb[i] == Reference_Dat$ID_Varb)
  if(length(find.match) != 0){
    data.df$Sex_Varb[i] <- Reference_Dat$Sex[find.match]
  }
  
}

Is there a faster way to create a new variable based on the matching values between two datasets?

alex
  • 153
  • 1
  • 10
  • `merge(data.df, Reference_Dat, by = "ID_Varb")`. – Gregor Thomas Sep 07 '22 at 15:13
  • Or, since you only want a single column you can use `match`: `data.df$Sex_Varb = Reference_Data$Sex[match(data.df$ID_Varb, Reference_Dat$ID_Varb)]`. But the `merge` or `left_join` approach is nicely general in case you have more columns you want to pull over. – Gregor Thomas Sep 07 '22 at 15:27

1 Answers1

0

Perhaps you are looking for a left join?

merge(data.df, Reference_dat[, c("ID_Varb", "Sex")], by="ID_Varb", all.x=TRUE)
langtang
  • 22,248
  • 1
  • 12
  • 27