3

Guys I need to merge different data frames from a list by row and maintain some information contained in the duplicate rows. Each row contains daily observation of some variables (stock prices) and each of the data frames contains different time spans (years). From one data frame to the other some variables could change (columns - stocks inside the index). bind_rows from seems to do a great job at simply adding columns with the new variables and leaving NAs elsewhere.

The point is that some of the data frames contain the last day of the previous period (that is therefore already bind from the previous data frame) but they slightly differ in the variables shown (columns). I don't want to completely eliminate one of the duplicate rows because they both contain information I need and I would rather prefer to merge them. The duplicate rows contain either the same value (because refer to the same day) or one NA and one value (because refer to the different variables in the set). How can I do?

The problem could be synthetized in the following example:

library(dplyr)
df_1 <- data.frame(Date=c(1:4),A=c(20,30,20,30),B=c(15,16,15,16)) 
df_2 <- data.frame(Date=c(4:7),A=c(30,35,60,40),C=c(15,18,25,20))
dfs<-list(df_1,df_2)
bind_rows(dfs)

Outcome:

  Date  A  B  C
1    1 20 15 NA
2    2 30 16 NA
3    3 20 15 NA
4    4 30 16 NA
5    4 30 NA 15
6    5 35 NA 18
7    6 60 NA 25
8    7 40 NA 20

Desired outcome:

  Date  A  B  C
1    1 20 15 NA
2    2 30 16 NA
3    3 20 15 NA
4    4 30 16 15
5    5 35 NA 18
6    6 60 NA 25
7    7 40 NA 20
Mr Frog
  • 296
  • 2
  • 16
  • 1
    Maybe: [How to collapse many records into one while removing NA values](https://stackoverflow.com/questions/28509462/how-to-collapse-many-records-into-one-while-removing-na-values) – Henrik May 24 '20 at 13:12
  • Not really clear the code behind the `setDT(..)` command they provided but it solved my case – Mr Frog May 24 '20 at 14:56
  • You may begin here: [Use of lapply .SD in data.table R](https://stackoverflow.com/questions/32276887/use-of-lapply-sd-in-data-table-r). Then, the stuff after `function(x)` you can try on any simple vector. Or pick the `dplyr` alternative, since you used that tag. – Henrik May 24 '20 at 17:30

1 Answers1

4

Instead of binding rows you can do a full join by Date and A column.

library(dplyr)
full_join(df_1, df_2, by = c('Date', 'A'))
#Thanks to @duckmayr for the suggestion. 

#   A  B  C
#1 20 15 NA
#2 30 16 NA
#3 20 15 NA
#4 30 16 15
#5 35 NA 18
#6 60 NA 25
#7 40 NA 20

which in base R, can be done as :

merge(df_1, df_2, by = c('Date', 'A'), all = TRUE)

If the data is in a list we can use Reduce

purrr::reduce(dfs, full_join, by = c('Date', 'A'))

Or

Reduce(function(x, y) merge(df_1, df_2, by = c('Date', 'A'), all = TRUE), dfs)
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213