4

I have a lot of old R code using the following syntax to perform what I think are left joins (or left outer joins if you prefer the SQL name):

merge(a, b, by="id", all.x=TRUE)

From my point of view, this is completely equivalent to using dplyr's dedicated function:

left_join(a, b, by="id")

I'm wondering if this is always the case or if the two can in some cases lead to different results. Please feel free to provide examples of when they could be considered equivalent and when not.

In this silly example, the two seems to yield the same result

require(dplyr)

a = data.frame(id=1:4, c(letters[1:3], NA)) %>% as_tibble()
b = data.frame(id=1:2) %>% as_tibble()

all_equal(left_join(b, a, by="id"), merge(b, a, by='id', all.x = T))
# TRUE

Why am I asking this question?

I'm asking this because, for instance, stats::aggregate and dplyr::group_by, if used with default arguments are not equivalent:

a %>% group_by(letter) %>% summarise(mean(id))

# # A tibble: 4 x 2
# letter `mean(id)`
# <fct>       <dbl>
# 1 a            1.00
# 2 b            2.00
# 3 c            3.00
# 4 <NA>         4.00

aggregate(id ~ letter, data = a, FUN = mean)

# letter id
# 1      a  1
# 2      b  2
# 3      c  3 

That, is they give the same result if you omit NAs from the dplyr's data (because the default for aggregate is na.omit). I'm asking also because when working with big datasets it's hard to spot at a glance why something is happening (especially when dealing with some code that was not written by you) and if you have to do some maintenance work, harmless sostitutions like those presented above can cause significant changes in the output.

EDIT: I'm using dplyr 0.7.4 and R 3.4.1.

mickkk
  • 1,172
  • 2
  • 17
  • 38
  • 4
    Note that `merge` may reorder the data. See `?merge`: "The rows are by default lexicographically sorted on the common columns, but for `sort = FALSE` are in an **unspecified order**." See e.g. [Merge two data frames while keeping the original row order](https://stackoverflow.com/questions/17878048/merge-two-data-frames-while-keeping-the-original-row-order) – Henrik Feb 26 '18 at 15:12
  • 1
    Interesting. this of coursed does not show up when checking for equality using all_equal with default arguments. – mickkk Feb 26 '18 at 15:45

1 Answers1

3

The tidyverse functions uses de NA as a part of data, because it should explain some aspects of information that can't be explained by "identified" data. In other words you must use a especific function to drop NA values. In your example there are many ways to perform the same process with equivalent results. For example, consider the na.omit() function:

library(tidyverse) #This include dplyr package

a = data.frame(id = 1:4, 
               letter = c(letters[1:3], NA)
               ) %>% as_tibble()

all.equal(
  a %>% 
    na.omit(letter) %>% #This drop NA values in the column "letter" 
    group_by(letter) %>% 
    summarise(id = mean(id)),

  aggregate(id ~ letter, 
            data = a, 
            FUN = mean ))
#>[1] TRUE

Other example is using filter() function:

all.equal(
  a %>% 
    filter(!is.na(letter)) %>% #This drop NA values in the column "letter" 
    group_by(letter) %>% 
    summarise(id = mean(id)),

  aggregate(id ~ letter, 
            data = a, 
            FUN = mean ))
#>[1] TRUE

Hope is can help you!

gavg712
  • 300
  • 1
  • 10