I have a lot of old R code using the following syntax to perform what I think are left joins (or left outer joins if you prefer the SQL name):
merge(a, b, by="id", all.x=TRUE)
From my point of view, this is completely equivalent to using dplyr's dedicated function:
left_join(a, b, by="id")
I'm wondering if this is always the case or if the two can in some cases lead to different results. Please feel free to provide examples of when they could be considered equivalent and when not.
In this silly example, the two seems to yield the same result
require(dplyr)
a = data.frame(id=1:4, c(letters[1:3], NA)) %>% as_tibble()
b = data.frame(id=1:2) %>% as_tibble()
all_equal(left_join(b, a, by="id"), merge(b, a, by='id', all.x = T))
# TRUE
Why am I asking this question?
I'm asking this because, for instance, stats::aggregate and dplyr::group_by, if used with default arguments are not equivalent:
a %>% group_by(letter) %>% summarise(mean(id))
# # A tibble: 4 x 2
# letter `mean(id)`
# <fct> <dbl>
# 1 a 1.00
# 2 b 2.00
# 3 c 3.00
# 4 <NA> 4.00
aggregate(id ~ letter, data = a, FUN = mean)
# letter id
# 1 a 1
# 2 b 2
# 3 c 3
That, is they give the same result if you omit NAs from the dplyr's data (because the default for aggregate is na.omit). I'm asking also because when working with big datasets it's hard to spot at a glance why something is happening (especially when dealing with some code that was not written by you) and if you have to do some maintenance work, harmless sostitutions like those presented above can cause significant changes in the output.
EDIT: I'm using dplyr 0.7.4 and R 3.4.1.