3

I have some dplyr code I'm moving to data.table, this is a problem I just ran into. I want the difference from one row to the next in b stored in column c if a is greater or equal than 3. However after running this code:

df = data.frame(a = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3), 
                b = c(0, 1, 0, 1, 0, 1, 1, 0, 3, 4, 5))

setDT(df)
df[ , c := ifelse(a >= 3, c(0, diff(b)), b), by = .(a)]

all the elements in c are 0. Why is this?

df 
    a b c
 1: 1 0 0
 2: 1 1 0
 3: 1 0 0
 4: 1 1 0
 5: 2 0 0
 6: 2 1 0
 7: 2 1 0
 8: 3 0 0
 9: 3 3 0
10: 3 4 0
11: 3 5 0

What I thought was the equivalent dplyr:

df = data.frame(a = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3), 
                b = c(0, 1, 0, 1, 0, 1, 1, 0, 3, 4, 5))

df %>% 
      group_by(a) %>% 
      mutate(c = ifelse( a >= 3, c(0, diff(b)), b))
M--
  • 25,431
  • 8
  • 61
  • 93
Rafael
  • 3,096
  • 1
  • 23
  • 61

2 Answers2

4

From the help for ifelse(test, yes, no), it should return...

A vector of the same length and attributes (including dimensions and "class") as test and data values from the values of yes or no. The mode of the answer will be coerced from logical to accommodate first any values taken from yes and then any values taken from no.

However:

> df %>% group_by(a) %>% do(print(.$a))
[1] 1 1 1 1
[1] 2 2 2
[1] 3 3 3 3
> data.table(df)[, print(a), by=a]
[1] 1
[1] 2
[1] 3

As explained in the help pages, since the first argument has a length of one, if you pass vectors for the other parts, only their first element is used:

> ifelse(TRUE, 1:10, eleventy + million)
[1] 1

You should probably use if ... else ... when working with a constant value, like...

> data.table(df)[, b := if (a >= 3) c(0, diff(b)) else b, by=a]

or even better, in this case you can assign to a subset:

> data.table(df)[a >= 3, b := c(0, diff(b)), by=a]

Regarding why a has length 1 for the data.table idiom, see its FAQ question "Inside each group, why are the group variables length-1?"

Frank
  • 66,179
  • 8
  • 96
  • 180
1

I am creating a dataset which has non-zero values for b as the first element of each group by a to illustrate better. Your previous dataset had all zeros and also c(0,diff(b)) was starting with zero so it was hard to differentiate.

What happens here is that output of ifelse is a vector of length 1.

library(data.table)

df = data.frame(a = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3), 
                b = c(10, 1, 0, 1, 0, 1, 1, 0, 3, 4, 5))

Look below:

setDT(df)[ , c := ifelse(a >= 3, c(0, diff(b)), b), by = .(a)][]
#>     a  b  c
#>  1: 1 10 10
#>  2: 1  1 10
#>  3: 1  0 10
#>  4: 1  1 10
#>  5: 2  0  0
#>  6: 2  1  0
#>  7: 2  1  0
#>  8: 3  0  0
#>  9: 3  3  0
#> 10: 3  4  0
#> 11: 3  5  0

Now, let's look at some other examples; here I am using a simple vector of length 4 (instead of c(0,diff(b))):

setDT(df)[ , c := ifelse(a >= 3L, c(20,2,3,4), -999), by=a][]
#>     a  b    c
#>  1: 1 10 -999
#>  2: 1  1 -999
#>  3: 1  0 -999
#>  4: 1  1 -999
#>  5: 2  0 -999
#>  6: 2  1 -999
#>  7: 2  1 -999
#>  8: 3  0   20
#>  9: 3  3   20
#> 10: 3  4   20 
#> 11: 3  5   20

You see that still the first element is getting assigned to all the rows of c for that group of a.

A work-around is using diff on a to see when it's not changing (i.e. diff(a)==0) and use that as a pseudo-grouping along with the other condition; like below:

setDT(df)[, c := ifelse(a >= 3 & c(F,diff(a)==0), c(0,diff(b)), b)][]
#>     a  b  c
#>  1: 1 10 10
#>  2: 1  1  1
#>  3: 1  0  0
#>  4: 1  1  1
#>  5: 2  0  0
#>  6: 2  1  1
#>  7: 2  1  1
#>  8: 3  0  0
#>  9: 3  3  3
#> 10: 3  4  1
#> 11: 3  5  1
M--
  • 25,431
  • 8
  • 61
  • 93
  • Hm, would upvote for the explanation and workarounds but the guess "If you use a vector [for assignment] instead, data.table will only use the first element." happens to be wrong – Frank Jun 12 '19 at 20:14
  • @Frank ```setDT(df)[ , c := ifelse(a >= 3L, c(20,2,3,4), -999), by=a][]``` What about this then? p.s. don't worry about the upvote, here to learn, not to achieve :D – M-- Jun 12 '19 at 20:15
  • The value you are assigning (on the right hand side of :=) is a length-one vector `setDT(df)[ , print(ifelse(a >= 3L, c(20,2,3,4), -999)), by=a][]`, right? – Frank Jun 12 '19 at 20:16
  • If I understand the parenthetical in your first sentence correctly then no, you’re wrong: dplyr *can* summarise each group into a single value, or it can create a vector of values for each group. – Konrad Rudolph Jun 12 '19 at 20:22
  • @KonradRudolph that was not the (main) issue tho. you can assign a single value or a vector in dplyr. Tha's not what the OP was concerned about and my sentence technically did not exclude possibility of using a single value with dplyr. It's mostly about how data.table handles this. However, thanks for pointing that out. – M-- Jun 12 '19 at 20:37