1

I have two data frames with shared columns (The same column names and types), both data frames have the same amount of rows (200K) and the same amount of columns (172). For brevity's sake, I'll represent them as follows:

df1:

ID COL1  COL2
1   X
2         X
3   X
4         X
5   X

df2:

ID COL1  COL2
1         Y
2   Y     
3   X
4   Y      
5         Y

I would like to obtain a third data frame allowing me to keep ALL relevant information by ID on any of the 2 data frames (meaning both common AND uncommon information). For this example, the result should be:

df_result:

ID COL1  COL2
1   X     Y
2   Y     X
3   X
4   Y     X
5   X     Y

So far I've separated the common and uncommon values using dplyr::anti_join() and dplyr::semi_join() But with this approach, its complicated to later aggregate them to the result data frame, since there's too many columns to compare (Since the results are based on column.x and column.y). Any help is appreciated.

Aaron Parrilla
  • 522
  • 3
  • 13
  • Is ```full_join()``` not doing what you want ? nor ```merge(..., all = TRUE)``` ? – tom Jul 31 '19 at 11:53
  • @tdel no because i can't add to the total number of rows, `full_join()` returns commons and uncommon but also add rows to the total when the columns are not common – Aaron Parrilla Jul 31 '19 at 11:55
  • https://stackoverflow.com/a/51332604/10580543 the answer of @thc might interests you, it shows how you can deal with duplicated rows when columns are uncommon. – tom Jul 31 '19 at 12:04

3 Answers3

3

Assuming that DF2 wins out when there are non-NA values, this solution would work:

DF3 <- DF1
DF3[!is.na(DF2)] <- DF2[!is.na(DF2)]

DF3

  ID col1 col2
1  1    x    y
2  2    y    x
3  3    x <NA>
4  4    y    x
5  5    x    y

If you start with a tibble or data.table, you could change it to a data.frame

DF3 <- as.data.frame(DF1)
DF3[!is.na(DF2)] <- as.data.frame(DF2)[!is.na(DF2)]

Two base options which would work with tibbles out of the box are:

DF3 <- DF1
DF3[, names(DF3)[-1]] <- mapply(function(x,y) ifelse(is.na(x), y,x), DF1[, -1], DF2[, -1])
DF3[, names(DF3)[-1]] <- lapply(seq_len(length(DF1))[-1], function(i) ifelse(is.na(DF1[[i]]), DF2[[i]], DF1[[i]]))

#if you don't mind packages, ```dplyr::coalesce``` is faster and reads better:
library(dplyr)
DF3[, names(DF3)[-1]] <- mapply(coalesce, DF1[, -1], DF2[, -1])
DF3[, names(DF3)[-1]] <- lapply(seq_len(length(DF1))[-1], function(i) coalesce(DF1[[i]], DF2[[i]]))

# A tibble: 5 x 3
     ID col1  col2 
  <int> <chr> <chr>
1     1 x     y    
2     2 y     x    
3     3 x     NA   
4     4 y     x    
5     5 x     y   

For a more robust option which can have data.frames with varying rows, we can use a data.table update join:

library(data.table)
dt_1 <- as.data.table(DF1)
dt_2 <- as.data.table(DF2)

cols = names(dt_1)[-1]

dt_1[dt_2
     , on = 'ID'
     , (cols) := lapply(seq_along(cols), function(i) coalesce(get(cols[i]), get(paste0('i.', cols[i]))))
     ]

dt_1

   ID col1 col2
1:  1    x    y
2:  2    y    x
3:  3    x <NA>
4:  4    y    x
5:  5    x    y

Performance All of my proposed base options do not have groupings which should allow for better vectorization.

# Data repeated to have 50,000 rows

Unit: milliseconds
             expr        min         lq        mean     median         uq        max neval
     cole_base_df    46.1678    46.6577    47.79072    46.7874    47.8612    51.4795     5
 cole_base_mapply    36.3574    38.0716    40.42820    39.5467    40.1889    47.9764     5
 cole_base_lapply    27.3791    30.1052    31.30574    31.2388    33.0415    34.7641     5
# lapply with coalesce
 cole_base_lapply     2.2017     2.2226     2.68914     2.2928     2.4140     4.3146     5
          cole_dt    11.6885    12.2909    12.41180    12.5288    12.7141    12.8367     5
     andrew_dplyr  7287.7865  7513.3745  7545.59520  7576.0932  7655.2974  7695.4244     5
        andrew_dt   624.4604   647.1066   674.93512   689.3315   698.1462   715.6309     5
      ronak_dplyr  9660.8393  9779.2466 10071.20714 10156.6727 10286.6954 10472.5817     5
       ronak_base 10399.2761 10526.9840 10613.55536 10691.6657 10723.1021 10726.7489     5

# Data repeated to have 500 rows

Unit: microseconds
             expr     min       lq       mean    median        uq      max neval
     cole_base_df   570.1   674.70    719.660    706.75    726.95   2736.2   100
 cole_base_mapply   580.8   640.75    696.913    671.35    695.75   2689.5   100
 cole_base_lapply   424.8   460.40    517.155    492.85    518.90   3220.3   100
          cole_dt  2645.5  3000.55   3120.355   3093.35   3167.45   5958.5   100
     andrew_dplyr 73523.4 76009.45  78125.912  77151.25  78673.85 125830.3   100
        andrew_dt  6777.3  7195.80   7644.179   7318.45   7579.15  11365.1   100
      ronak_dplyr 94523.3 99039.50 102829.575 100026.15 101643.70 169167.3   100
       ronak_base 93602.2 96086.30  97806.927  97470.05  98376.60 123348.8   100 

Data:

DF1 <- data.frame(ID = seq_len(5)
                  ,col1 = c('x', NA_character_, 'x', NA_character_, 'x')
                  ,col2 = c(NA_character_, 'x', NA_character_, 'x', NA_character_)
                  , stringsAsFactors = F)

DF2 <- data.frame(ID = seq_len(5)
                  ,col1 = c(NA_character_, 'y', 'x', 'y', NA_character_)
                  ,col2 = c('y', NA_character_, NA_character_, NA_character_, 'y')
                  , stringsAsFactors = F)

Complete code for folks to do their own benchmarks:

library(microbenchmark)
library(dplyr)
library(data.table)

DF1 <- data.frame(ID = seq_len(5)
                  ,col1 = c('x', NA_character_, 'x', NA_character_, 'x')
                  ,col2 = c(NA_character_, 'x', NA_character_, 'x', NA_character_)
                  , stringsAsFactors = F
)

DF2 <- data.frame(ID = seq_len(5)
                  ,col1 = c(NA_character_, 'y', 'x', 'y', NA_character_)
                  ,col2 = c('y', NA_character_, NA_character_, NA_character_, 'y')
                  , stringsAsFactors = F
)

n_rep <- 100 #change to 10000 if you want 50,000 rows)

DF1 <- do.call(rbind, replicate(n_rep, DF1, simplify = F))
DF1$ID <- seq_len(nrow(DF1))
DF2 <- do.call(rbind, replicate(n_rep, DF2, simplify = F))
DF2$ID <- seq_len(nrow(DF2))

dt_1 <- as.data.table(DF1)
dt_2 <- as.data.table(DF2)

microbenchmark(
  cole_base_df = {
    DF3 <- DF1
    DF3[!is.na(DF2)] <- DF2[!is.na(DF2)]
  }
  ,cole_base_mapply = {
    DF3 <- DF1
    DF3[, names(DF3)[-1]] <- mapply(function(x,y) ifelse(is.na(x), y,x), DF1[, -1], DF2[, -1])
    # or better
    # DF3[, names(DF3)[-1]] <- mapply(dplyr::coalesce, DF1[, -1], DF2[, -1])
  }
  ,cole_base_lapply = {
    DF3 <- DF1
    DF3[, names(DF3)[-1]] <- lapply(seq_len(length(DF1))[-1], function(i) ifelse(is.na(DF1[[i]]), DF2[[i]], DF1[[i]]))
    # or better
    # DF3[, names(DF3)[-1]] <- lapply(seq_len(length(DF1))[-1], function(i) dplyr::coalesce(DF1[[i]], DF2[[i]]))
  }
  ,cole_dt = {
    cols = names(dt_1)[-1]

    copy(dt_1)[copy(dt_2)
               , on = 'ID'
               , (cols) := lapply(seq_along(cols), function(i) coalesce(get(cols[i]), get(paste0('i.', cols[i]))))
               ][]
  }
  , andrew_dplyr = {
    dplyr::union(DF1, DF2) %>%
      group_by(ID) %>%
      mutate_at(vars(starts_with("col")), ~ifelse(any(!is.na(.)), .[!is.na(.)], .)) %>%
      distinct
  }
  , andrew_dt = {
    rbindlist(list(DF1, DF2))[, lapply(.SD, function(x) ifelse(any(!is.na(x)), x[!is.na(x)], x)), by = "ID"]
  }
  , ronak_dplyr = {
    bind_rows(DF1, DF2) %>%
      group_by(ID) %>%
      summarise_at(vars(starts_with("col")), ~toString(na.omit(unique(.))))
  }
  , ronak_base = {
    aggregate(.~ID, rbind(DF1, DF2), 
              function(x) toString(na.omit(unique(x))), na.action = "na.pass")
  }
  , times = 5
)
Cole
  • 11,130
  • 1
  • 9
  • 24
  • I can't do it the data table way because I have 172 columns and your first solution return an error for me `Unsupported matrix index in replacement` – Aaron Parrilla Jul 31 '19 at 12:31
  • See edits. I think the error is a good reason for ```dput``` or at least producing an output in copy and paste format. A ```tibble``` doesn't allow for matrix indexing unfortunately. Using ```as.data.frame()``` would make the first method work but I added more options that would work without coercing to a ```data.frame```. I also made the ```data.table``` option more robust. – Cole Aug 01 '19 at 02:58
3

Here is a dplyr solution that should be flexible (note: it is probable more efficient to use summarise_at like Ronak Shah rather than mutate_at + distinct):

library(dplyr)

dplyr::union(df1, df2) %>%
  group_by(ID) %>%
  mutate_at(vars(starts_with("COL")), ~ifelse(any(!is.na(.)), .[!is.na(.)], .)) %>%
  distinct

  ID    COL1  COL2 
  <chr> <chr> <chr>
1 1     X     Y    
2 2     Y     X    
3 3     X     NA   
4 4     Y     X    
5 5     X     Y  

Or, using the same logic on a data.table:

library(data.table)

setDT(rbind(df1, df2))[, lapply(.SD, function(x) ifelse(any(!is.na(x)), x[!is.na(x)], x)), by = "ID"]

Data (note, I added NA where you had blank cells):

df1 <- read.table(header = T, text = "ID COL1  COL2
           1   X NA
           2  NA       X
           3   X NA
           4  NA       X
           5   X NA")

df2 <- read.table(header = T, text = "ID COL1  COL2
1  NA       Y
2   Y NA     
3   X NA
4   Y NA     
5   NA      Y")
Andrew
  • 5,028
  • 2
  • 11
  • 21
2

Here is another version using dplyr is to bind two dataframes together , group_by ID and paste all unique values together.

library(dplyr)

bind_rows(df1, df2) %>%
   group_by(ID) %>%
   summarise_at(vars(starts_with("COL")), ~toString(na.omit(unique(.))))

#  ID    COL1  COL2 
#  <chr> <chr> <chr>
#1 1     X     Y    
#2 2     Y     X    
#3 3     X     ""   
#4 4     Y     X    
#5 5     X     Y    

and similar using aggregate in base R

aggregate(.~ID, rbind(df1, df2), 
         function(x) toString(na.omit(unique(x))), na.action = "na.pass")
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213