1

I have 2 dataframes in R: 'dfold' with 175 variables and 'dfnew' with 75 variables. The 2 datframes are matched by a primary key (that is 'pid'). dfnew is a subset of dfold, so that all variables in dfnew are also on dfold but with updated, imputed values (no NAs anymore). At the same time dfold has more variables, and I will need them in the analysis phase. I would like to merge the 2 dataframes in dfmerge so to update common variables from dfnew --> dfold but at the same time retaining pre-existing variables in dfold. I have tried merge(), match(), dplyr, and sqldf packages, but either I obtain a dfmerge with the updated 75 variables only (left join) or a dfmerge with 250 variables (old variables with NAs and new variables without them coexist). The only way I found (here) is an elegant but pretty long (10 rows) loop that is eliminating *.x variables after a merge by pid with all.x = TRUE option). Might you please advice on a more efficient way to obtain such result if available ?

Thank you in advance

P.S: To make things easier, I have created a minimal version of dfold and dfnew: dfnew has now 3 variables, no NAs, while dfold has 5 variables, NAs included. Here it is the dataframes structure

dfold:

structure(list(Country = structure(c(1L, 3L, 2L, 3L, 2L), .Label = c("France", 
"Germany", "Spain"), class = "factor"), Age = c(44L, 27L, 30L, 
38L, 40L), Salary = c(72000L, 48000L, 54000L, 61000L, NA), Purchased = structure(c(1L, 
2L, 1L, 1L, 2L), .Label = c("No", "Yes"), class = "factor"), 
    pid = 1:5), .Names = c("Country", "Age", "Salary", "Purchased", 
"pid"), row.names = c(NA, 5L), class = "data.frame")

dfnew:

structure(list(Age = c(44, 27, 30), Salary = c(72000, 48000, 
54000), pid = c(1, 2, 3)), .Names = c("Age", "Salary", "pid"), row.names = c(NA, 
3L), class = "data.frame")

Although here the issue is limited to just 2 variables Please remind that the real scenario will involve 75 variables.

Diego
  • 127
  • 2
  • 13
  • Could you provide some meaningful sample data? Maybe 10 rows of each dataframe with 5 common variables and additional 5 others in the dfold-subsample? – LAP Oct 18 '17 at 07:33
  • would definitely be happy to do, not sure how to attach data here though, any hint ? – Diego Oct 18 '17 at 07:46
  • The best way would be to use the `dput()` command on a subset as explained above, something like `dput(dfold[1:10, 1:10])` and `dput(dfnew[1:10, 1:5])` (you'd probably need to change the column indices to get 5 common and 5 uncommon variables). Then paste the output from the `R` console (usually starts with `structure(...`) into your question text. – LAP Oct 18 '17 at 07:52
  • Ok edited in the original post, thanks : ) I have just made things even easier but dfold and dfnew should still be valid – Diego Oct 18 '17 at 08:08
  • I'll just recreate it myself, but optimally the `dfnew` should include all 5 rows, especially the one with the `NA` value in the column `Salary`. – LAP Oct 18 '17 at 08:10
  • Do you only want to update `NA` values in `dfold` columns with imputed values from `dfnew`? If so, you could do it without a merge. – LAP Oct 18 '17 at 08:20
  • Yup ! This is exactly what I want, updating dfold with imputed values from dfnew, at the same time retaining old untouched variables in dfold – Diego Oct 18 '17 at 08:21

1 Answers1

1

Alright, this solution assumes that you don't really need a merge but only want to update NA values within your dfold with imputed values in dfnew.

> dfold
  Country Age Salary Purchased pid
1  France  NA  72000        No   1
2   Spain  27  48000       Yes   2
3 Germany  30  54000        No   3
4   Spain  38  61000        No   4
5 Germany  40     NA       Yes   5

> dfnew
  Age Salary pid
1  44  72000   1
2  27  48000   2
3  30  54000   3
4  38  61000   4
5  40  70000   5

To do this for a single column, try

dfold$Salary <- ifelse(is.na(dfold$Salary), dfnew$Salary[dfnew$pid == dfold$pid], dfold$Salary)

> dfold
  Country Age Salary Purchased pid
1  France  NA  72000        No   1
2   Spain  27  48000       Yes   2
3 Germany  30  54000        No   3
4   Spain  38  61000        No   4
5 Germany  40  70000       Yes   5

Using it on the whole dataset was a bit trickier:

First define all common colnames except pid:

cols <- names(dfnew)[names(dfnew) != "pid"]

> cols
[1] "Age"    "Salary"

Now use mapply to replace the NA values with ifelse:

dfold[,cols] <- mapply(function(x, y) ifelse(is.na(x), y[dfnew$pid == dfold$pid], x), dfold[,cols], dfnew[,cols])

> dfold
  Country Age Salary Purchased pid
1  France  44  72000        No   1
2   Spain  27  48000       Yes   2
3 Germany  30  54000        No   3
4   Spain  38  61000        No   4
5 Germany  40  70000       Yes   5

This assumes that dfnew only includes columns that are present in dfold. If this is not the case, use

cols <- names(dfnew)[which(names(dfnew) %in% names(dfold))][names(dfnew) != "pid"]
LAP
  • 6,605
  • 2
  • 15
  • 28
  • 1
    Great and Elegant solution LAP, just tested on the whole real-life dataset and worked like a champ : ) – Diego Oct 18 '17 at 09:31
  • Just to close the topic appropriately and for future reference: in your opinion the same operation cannot be performed with merge or a JOIN query ? – Diego Oct 18 '17 at 09:32
  • After referring to different questions on Stackoverflow, I am quite certain that this cannot be done with a simple merge or join. You'd either need to define the cells to be overwritten beforehand, or delete the `.x`/`.y` variables with are unnecessarily created afterwards. See [this thread](https://stackoverflow.com/questions/16042380/merge-data-frames-and-overwrite-values) for reference. – LAP Oct 18 '17 at 09:36
  • Perfect, in facts the loop I had referred to in the original post is just about that: deleting the .x (unnecessarily) created after merging. Again, thanks a lot for the time you have spent and for your support. – Diego Oct 18 '17 at 09:40