21

What is the easiest way to do a left outer join on two data tables (dt1, dt2) with the fill value being 0 (or some other value) instead of NA (default) without overwriting valid NA values in the left data table?

A common answer, such as in this thread is to do the left outer join with either dplyr::left_join or data.table::merge or data.table's dt2[dt1] keyed column bracket syntax, followed by a second step simply replacing all NA values by 0 in the joined data table. For example:

library(data.table);
dt1 <- data.table(x=c('a', 'b', 'c', 'd', 'e'), y=c(NA, 'w', NA, 'y', 'z'));
dt2 <- data.table(x=c('a', 'b', 'c'), new_col=c(1,2,3));
setkey(dt1, x);
setkey(dt2, x);
merged_tables <- dt2[dt1];
merged_tables[is.na(merged_tables)] <- 0;

This approach necessarily assumes that there are no valid NA values in dt1 that need to be preserved. Yet, as you can see in the above example, the results are:

   x new_col y
1: a       1 0
2: b       2 w
3: c       3 0
4: d       0 y
5: e       0 z

but the desired results are:

   x new_col y
1: a       1 NA
2: b       2 w
3: c       3 NA
4: d       0 y
5: e       0 z

In such a trivial case, instead of using the data.table all elements replace syntax as above, just the NA values in new_col could be replaced:

library(dplyr);
merged_tables <- mutate(merged_tables, new_col = ifelse(is.na(new_col), 0, new_col));

However, this approach is not practical for very large data sets where dozens or hundreds of new columns are merged, sometimes with dynamically created column names. Even if the column names were all known ahead of time, it's very ugly to list out all the new columns and do a mutate-style replace on each one.

There must be a better way? The issue would be simply resolved if the syntax of any of dplyr::left_join, data.table::merge, or data.table's bracket easily allowed the user to specify a fill value other than NA. Something like:

merged_tables <- data.table::merge(dt1, dt2, by="x", all.x=TRUE, fill=0);

data.table's dcast function allows the user to specify fill value, so I figure there must be an easier way to do this that I'm just not thinking of.

Suggestions?

EDIT: @jangorecki pointed out in the comments that there is a feature request currently open on the data.table GitHug page to do exactly what I just mentioned, updating the nomatch=0 syntax. Should be in the next release of data.table.

Community
  • 1
  • 1
Mekki MacAulay
  • 1,727
  • 2
  • 12
  • 23
  • 1
    At the end of your `merge` or outer join using `[..., nomatch=NA]` chain data.table query `merge()[is.na(col), col := 0]`. There is an open FR so `nomatch` arg could handle arbitrary values, currently for outer join it can only use `NA`. – jangorecki Feb 03 '16 at 20:25
  • I'm sorry, but I'm having trouble understanding your answer. Where does `col` come from? Glad to hear there's an open feature request. I'll add my +1 to it. – Mekki MacAulay Feb 03 '16 at 20:36
  • `col` is just a column on which join is being made – jangorecki Feb 03 '16 at 21:15
  • On which the join is being made? Wouldn't that be the key, `x` in this case? What you are suggesting works if `col` is specified as the name of the new column added, in this case `new_col`, but unless I'm misunderstanding, you would still have to chain this replacement for each new column added manually, much like the `dplyr::mutate` in my example, which isn't practical for hundreds of new columns merged. Am I misunderstanding? – Mekki MacAulay Feb 03 '16 at 21:30
  • Nope, the `col` as arbitrary column name on which join will be made – jangorecki Feb 03 '16 at 23:34
  • I apologize again for not being fluent enough in `data.table` syntax to understand what you are proposing (I'm working on it ! :) ). Is replacing the last two lines in my example with: `merged_tables <- dt2[dt1, nomatch=NA][is.na(col), col := 0];` what you had in mind? If so, I get `"Warning message: In is.na(col) : is.na() applied to non-(list or vector) of type 'closure'"` and none of the `NA` values are set to `0`. Could I trouble you to clarify once more please? – Mekki MacAulay Feb 04 '16 at 00:03
  • 1
    Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/102537/discussion-between-jangorecki-and-mekki-macaulay). – jangorecki Feb 04 '16 at 00:21

3 Answers3

9

I stumbled on the same problem with dplyr and wrote a small function that solved my problem. (the solution requires tidyr and dplyr)

left_join0 <- function(x, y, fill = 0L, ...){
  z <- left_join(x, y, ...)
  new_cols <- setdiff(names(z), names(x))
  z <- replace_na(z, setNames(as.list(rep(fill, length(new_cols))), new_cols))
  z
}
Fernando Macedo
  • 355
  • 3
  • 7
  • This is a wonderfully simple answer and exactly what I needed. Thanks! – Dave Kincaid Jun 20 '18 at 17:27
  • Thanks, this was a big help :) One small improvement - if you put `...` as an additional argument to the function definition and the `left_join` call, then the caller will be able to use other parameters like `by=`. – rescdsk Oct 28 '18 at 00:47
2

Could you use column indices to refer only to the new columns, as with left_join they'll all be on the right of the resulting data.frame? Here it would be in dplyr:

dt1 <- data.frame(x = c('a', 'b', 'c', 'd', 'e'),
                  y = c(NA, 'w', NA, 'y', 'z'),
                  stringsAsFactors = FALSE)
dt2 <- data.frame(x = c('a', 'b', 'c'),
                  new_col = c(1,2,3),
                  stringsAsFactors = FALSE)

merged <- left_join(dt1, dt2)
index_new_col <- (ncol(dt1) + 1):ncol(merged)
merged[, index_new_col][is.na(merged[, index_new_col])] <- 0

> merged
  x    y new_col
1 a <NA>       1
2 b    w       2
3 c <NA>       3
4 d    y       0
5 e    z       0
Sam Firke
  • 21,571
  • 9
  • 87
  • 105
  • Is it guaranteed that they're going to be on the right in the resulting data.frame if it's a data.table when using `dplyr::left_join`? Note that in the `data.table` example, the columns were inserted to the right of the key column, not to the right of all existing `x` columns. – Mekki MacAulay Feb 03 '16 at 20:57
  • 1
    `left_join` reliably places the columns from the right / second table on the right. I don't know how`data.table` orders columns when using its merge function. I did this example w/ `data.frame` objects because the data.table package was overloading the `[` in my last line, but I get the same column order when I `left_join` the two `data.table`s. Besides unloading data.table to make the last line run in my example, the objects can be coerced with `as_data_frame` before the join - or someone who knows `data.table` could adapt that last line to work as it does in base R. – Sam Firke Feb 03 '16 at 21:11
  • Thanks for your answer. I'm working on very large datasets so I'm a bit concerned about switching between data.table & data.frame each time I do a merge (I do many). I'll do some performance testing to see if it's viable. Your answer may well be the best one until `data.table` v1.9.8 is released with the new options for `nomatch`. – Mekki MacAulay Feb 04 '16 at 00:06
2

The cleanest way at present may simply be to seed an intermediary table with the values to be joined on in the left table (dt1), chain a merge of dt2, set NA values to 0, merge intermediary table with dt1. Can be done entirely with data.table and doesn't depend on data.frame syntax, and the intermediary step ensures that there will be no nomatch NA results in the second merge:

library(data.table);
dt1 <- data.table(x=c('a', 'b', 'c', 'd', 'e'), y=c(NA, 'w', NA, 'y', 'z'));
dt2 <- data.table(x=c('a', 'b', 'c'), new_col=c(1,2,3));
setkey(dt1, x);
setkey(dt2, x);
inter_table <- dt2[dt1[, list(x)]];
inter_table[is.na(inter_table)] <- 0;
setkey(inter_table, x);
merged <- inter_table[dt1];

> merged;
   x new_col  y
1: a       1 NA
2: b       2  w
3: c       3 NA
4: d       0  y
5: e       0  z

The benefit of this approach is that it doesn't depend on new columns being added on the right and stays inside data.table keyed speed optimizations. Crediting answer to @SamFirke because his solution also works and may be more useful in other contexts.

Mekki MacAulay
  • 1,727
  • 2
  • 12
  • 23