4

I ran into something today when using . and %>% which I don't quite understand. Now I am not sure if I understand either operator.

Data

set.seed(1)
df <- setDT(data.frame(id = sample(1:5, 10, replace = T), value = runif(10)))

Why are are these three equivelant

df[, .(Mean = mean(value)), by = .(id)] %>% .$Mean %>% sum()
[1] 3.529399
df[, .(Mean = mean(value)), by = .(id)] %>% {sum(.$Mean)}
[1] 3.529399
sum(df[, .(Mean = mean(value)), by = .(id)]$Mean)
[1] 3.529399

But this answer so different?

df[, .(Mean = mean(value)), by = .(id)] %>% sum(.$Mean)
[1] 22.0588

Could someone explain to me how the pipe operator actually works w.r.t to . usage. I used to think along the lines of Go fetch what sits on the left of the %>%.

Investigation that left me more confused

I tried replacing the sum with print to see what was actually going on

# As Expected
df[, .(Mean = mean(value)), by = .(id)] %>% .$Mean %>% print()
[1] 0.5111589 0.7698414 0.7475319 0.9919061 0.5089610
df[, .(Mean = mean(value)), by = .(id)] %>% print(.$Mean) %>% sum()
[1] 3.529399

# Surprised
df[, .(Mean = mean(value)), by = .(id)] %>% print(.$Mean)
    id      Mean
 1:  1 0.5111589
---             
 5:  3 0.5089610

# Same
df[, .(Mean = mean(value)), by = .(id)] %>% sum(print(.$Mean))
[1] 22.0588

# Utterly Confused
df[, .(Mean = mean(value)), by = .(id)] %>% print(.$Mean) %>% sum()
[1] 18.5294 #Not even the same as above??

Edit: Looks like nothing to do with data.table or how it was grouped, same issue with data.frame:

x <- data.frame(x1 = 1:3, x2 = 4:6)

sum(x$x1)
# [1] 6
sum(x$x2)
# [1] 15

x %>% .$x1 %>% sum
# [1] 6
x %>% .$x2 %>% sum
# [1] 15

# Why?
x %>% sum(.$x1)
# [1] 27
x %>% sum(.$x2)
# [1] 36
zx8754
  • 52,746
  • 12
  • 114
  • 209
Croote
  • 1,382
  • 1
  • 7
  • 15
  • 1
    The value of `18.5294` can be explained better then `22.0588`; it is the result of summing the complete dataset, see: `df[, .(Mean = mean(value)), by = .(id)] %>% unlist() %>% sum()` – Jaap Mar 16 '20 at 07:41
  • on the second example `x %>% pull(x1) %>% sum()` does also give the expected result – Jaap Mar 16 '20 at 07:51
  • @Jaap Yeah, pull or pre-selecting before sum works fine. – zx8754 Mar 16 '20 at 07:52

1 Answers1

1

The updated short example helps.

As we know when using pipes, the first argument comes from LHS (unless we "stop" it by {}) so what happens is :

x %>% sum(.$x1)
#[1] 27

is equivalent to

sum(x, x$x1)
#[1] 27

The complete sum of dataframe is added with column x1.


As far as the original example is concerned, we can verify the same behavior

library(data.table)

temp <- df[, .(Mean = mean(value)), by = .(id)]
sum(temp, temp$Mean)
#[1] 22.0588
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • 1
    Thanks, `sum(x, x$x1)` makes sense now especially in the context of the function needing its first argument `f(x, ...)` specified. – Croote Mar 16 '20 at 21:26