What is the best way of "fusioning" two data frames?

Question

I have two data frames (df_a and df_b) both containing a number of columns and patient IDs. The columns contain information about particular diagnosis (TRUE or FALSE). My task is to combine the two data frames so that the value is TRUE if the value is TRUE in any of the data frames.

In reality there is a different number of rows in df_a and df_b. The matching should be done using the id. Both data frames have the same set of columns.

df_a <- data.frame(id = 1:10,
                 dg_a = c(T, T, T, F, F, F, T, T, F, T), 
                 dg_b = c(F, F, F, F, T, T, F, T, T, F))

df_b <- data.frame(id = 1:10, 
                 dg_a = c(F, F, F, T, F, F, F, T, T, T), 
                 dg_b = c(F, T, T, F, F, T, F, T, F, F))

I.e. after combining data frames df_a and df_b I should get df_c.

>df_c
   id  dg_a  dg_b
1   1  TRUE FALSE
2   2  TRUE  TRUE
3   3  TRUE  TRUE
4   4  TRUE FALSE
5   5 FALSE  TRUE
6   6 FALSE  TRUE
7   7  TRUE FALSE
8   8  TRUE  TRUE
9   9  TRUE  TRUE
10 10  TRUE FALSE

What is the best way of doing this in R? I have tried different types of joins but I can't quite figure it out.

@MauritsEvers I agree the solution is a variant of that and can be considered a duplicate of the link you provided. Thanks for catching that. — steveb, Feb 12 '18 at 06:01
I think this is a good question. I have added a generic/flexible answer in the [linked](https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right/48754261#48754261) question. You can have look. It may help you in cases where you got many columns. I had to add answer to the linked question as this question is marked as duplicate. — MKR, Feb 12 '18 at 19:43
I think this is different enough to stand on it's own, perhaps with a note to the more basic/general question. @MKR, I'd suggest moving your answer to this question, as it does not really answer the question you posted it at. — Gregor Thomas, Feb 12 '18 at 22:15

score 2 · Answer 1 · answered Feb 12 '18 at 05:26

2

You can simply use the | operation:

df_c <- data.frame(id = 1:10,
                   dg_a = df_a$dg_a | df_b$dg_a,
                   dg_b = df_a$dg_b | df_b$dg_b)

answered Feb 12 '18 at 05:26

TYZ

8,466
5
29
60

What if I have different number of rows in the data frames (and not all the IDs are the same)? – JuusoT Feb 12 '18 at 05:34
@JuusoT You can try my solution (provided as answer in to your question). That solution should be able to handle situations where number of rows are different, IDs are different or even columns are different in two dataframes. – MKR Feb 13 '18 at 17:45

steveb · Answer 2 · 2018-02-12T05:39:56.907

2

Here is a dplyr solution which takes into consideration that you may have a different number of rows in both data.frames:

library(dplyr)

full_join(df_a, df_b, by = "id") %>%
    transmute(id,
              dg_a = dg_a.x | dg_a.y,
              dg_b = dg_b.x | dg_b.y)

This will include rows from both data.frames even if there isn't a match; see the other dplyr join functions for other behaviors.

edited Feb 12 '18 at 05:39

answered Feb 12 '18 at 05:33

steveb

5,382
2
27
36

score 1 · Accepted Answer · answered Feb 12 '18 at 22:32

A very generic solution can be achieved using dplyr and tidyr packages.

The approach: Since number of columns can be many, hence I think it would be better to first use gather to create dataframes in form of id, key and val. Then use bind_rows to merger dataframes and apply logic derive value(i.e. even one value is TRUE then combined value will be TRUE. Finally use spread from tidyr package to return dataframes in original format.

df_a <- data.frame(id = 1:10,
                   dg_a = c(T, T, T, F, F, F, T, T, F, T), 
                   dg_b = c(F, F, F, F, T, T, F, T, T, F))

df_b <- data.frame(id = 1:10, 
                   dg_a = c(F, F, F, T, F, F, F, T, T, T), 
                   dg_b = c(F, T, T, F, F, T, F, T, F, F))


library(dplyr)
library(tidyr)

df_a %>% gather(key = "key", value = "val", -id) %>% 
         bind_rows(gather(df_b, key = "key", value = "val", -id )) %>%
         group_by(id, key) %>%
         summarise(val = ifelse(sum(val == TRUE) > 0, TRUE, FALSE)) %>%
         spread(key, val) %>% as.data.frame()

# Result  

#    id  dg_a  dg_b
# 1   1  TRUE FALSE
# 2   2  TRUE  TRUE
# 3   3  TRUE  TRUE
# 4   4  TRUE FALSE
# 5   5 FALSE  TRUE
# 6   6 FALSE  TRUE
# 7   7  TRUE FALSE
# 8   8  TRUE  TRUE
# 9   9  TRUE  TRUE
# 10 10  TRUE FALSE

What is the best way of "fusioning" two data frames?

3 Answers3