Binding rows from list with meaningful duplicates in R

Question

Guys I need to merge different data frames from a list by row and maintain some information contained in the duplicate rows. Each row contains daily observation of some variables (stock prices) and each of the data frames contains different time spans (years). From one data frame to the other some variables could change (columns - stocks inside the index). bind_rows from dplyr seems to do a great job at simply adding columns with the new variables and leaving NAs elsewhere.

The point is that some of the data frames contain the last day of the previous period (that is therefore already bind from the previous data frame) but they slightly differ in the variables shown (columns). I don't want to completely eliminate one of the duplicate rows because they both contain information I need and I would rather prefer to merge them. The duplicate rows contain either the same value (because refer to the same day) or one NA and one value (because refer to the different variables in the set). How can I do?

The problem could be synthetized in the following example:

library(dplyr)
df_1 <- data.frame(Date=c(1:4),A=c(20,30,20,30),B=c(15,16,15,16)) 
df_2 <- data.frame(Date=c(4:7),A=c(30,35,60,40),C=c(15,18,25,20))
dfs<-list(df_1,df_2)
bind_rows(dfs)

Outcome:

  Date  A  B  C
1    1 20 15 NA
2    2 30 16 NA
3    3 20 15 NA
4    4 30 16 NA
5    4 30 NA 15
6    5 35 NA 18
7    6 60 NA 25
8    7 40 NA 20

Desired outcome:

  Date  A  B  C
1    1 20 15 NA
2    2 30 16 NA
3    3 20 15 NA
4    4 30 16 15
5    5 35 NA 18
6    6 60 NA 25
7    7 40 NA 20

Maybe: [How to collapse many records into one while removing NA values](https://stackoverflow.com/questions/28509462/how-to-collapse-many-records-into-one-while-removing-na-values) — Henrik, May 24 '20 at 13:12
Not really clear the code behind the `setDT(..)` command they provided but it solved my case — Mr Frog, May 24 '20 at 14:56
You may begin here: [Use of lapply .SD in data.table R](https://stackoverflow.com/questions/32276887/use-of-lapply-sd-in-data-table-r). Then, the stuff after `function(x)` you can try on any simple vector. Or pick the `dplyr` alternative, since you used that tag. — Henrik, May 24 '20 at 17:30

Ronak Shah · Answer 1 · 2020-05-24T12:35:03.347

4

Instead of binding rows you can do a full join by Date and A column.

library(dplyr)
full_join(df_1, df_2, by = c('Date', 'A'))
#Thanks to @duckmayr for the suggestion. 

#   A  B  C
#1 20 15 NA
#2 30 16 NA
#3 20 15 NA
#4 30 16 15
#5 35 NA 18
#6 60 NA 25
#7 40 NA 20

which in base R, can be done as :

merge(df_1, df_2, by = c('Date', 'A'), all = TRUE)

If the data is in a list we can use Reduce

purrr::reduce(dfs, full_join, by = c('Date', 'A'))

Or

Reduce(function(x, y) merge(df_1, df_2, by = c('Date', 'A'), all = TRUE), dfs)

edited May 24 '20 at 12:35

answered May 24 '20 at 12:23

Ronak Shah

377,200
20
156
213

I'm sorry I didn't mention that the data frames are into a list – Mr Frog May 24 '20 at 12:29
1

@MrFrog See [this post](https://stackoverflow.com/a/14096854/8245406). – Rui Barradas May 24 '20 at 12:35
What if the As are hundreds of variables?? – Mr Frog May 24 '20 at 12:38
@MrFrog you can gather columns by position i.e `cols <- names(df)[1:100]` and then use this `cols` in `by` argument. Or if there is some pattern in columns you can also use `grep`. – Ronak Shah May 24 '20 at 13:48

Binding rows from list with meaningful duplicates in R

1 Answers1