0

Update - it seems that with = F is incompatible with expressions in j and also with (at least some) by = situations.

Taking the scenario below and simplifying it as much as possible:

dt <- data.table(group1 = c("a", "a", "a", "b", "b", "b"),
                 group2 = c("x", "x", "y", "y", "z", "z"),
                 data = c(rep(T, 3), rep(F, 3)))

dt[
  ,
  3,
  with = F,
  by = list(group1, group2)
]

    data
1:  TRUE
2:  TRUE
3:  TRUE
4: FALSE
5: FALSE
6: FALSE
> 

dt[
  ,
  data,
  by = list(group1, group2)
]

   group1 group2  data
1:      a      x  TRUE
2:      a      x  TRUE
3:      a      y  TRUE
4:      b      y FALSE
5:      b      z FALSE
6:      b      z FALSE
>

The expression behavior is documented in a roundabout way in ?data.table:

A single column name, single expresson of column names, list() of expressions of column names, an expression or function call that evaluates to list (including data.frame and data.table which are lists, too), or (when with=FALSE) a vector of names or positions to select.

I don't see any documentation of with = F disabling by = in the documentation, but it seems that in this case it does.


I'm having an issue where data.table either uses or ignores by = depending on whether I use with = F.

library(data.table)

dt <- data.table(group1 = c("a", "a", "a", "b", "b", "b"),
                 group2 = c("x", "x", "y", "y", "z", "z"),
                 data = c(rep(T, 3), rep(F, 3)))

# without with = F

dt[
  as.vector(!is.na(dt[, 3, with = F])),
  sum(data),
  by = list(group1, group2)
]
>
   group1 group2 V1
1:      a      x  2
2:      a      y  1
3:      b      y  0
4:      b      z  0 

# with = F

dt[
  as.vector(!is.na(dt[, 3, with = F])),
  sum(3),
  with = F,
  by = list(group1, group2)
]
>
    data
1:  TRUE
2:  TRUE
3:  TRUE
4: FALSE
5: FALSE
6: FALSE

I've tried using a vector of numbers, and a vector of characters for by =, neither work.

sum() is an example function, I have the same basic issue when I don't use a function on j.

In the end, I need to use with = F to iterate across multiple columns of the data.table in a for loop.

Any suggestions?

Chris
  • 313
  • 1
  • 11
  • I guess you are looking for this: `dt[!is.na(3), sum(data), by = .(group1, group2)]`. The part of `as.vector(!is.na(dt[, 3, with = F]))` is overcomplicating things imo. Instead, you can just use: `!is.na(3)` – Jaap Nov 11 '15 at 20:04
  • @Jaap, I think they want to use "3" instead of "data".... – A5C1D2H2I1M1N2O1R2T1 Nov 11 '15 at 20:06
  • That's correct, I need to iterate over this in a for loop. Really there are multiple columns of `data`. – Chris Nov 11 '15 at 20:08
  • @AnandaMahto Changed it, but basically it's the same. – Jaap Nov 11 '15 at 20:08
  • Could you explain what the expected output should be? – Jaap Nov 11 '15 at 20:09
  • I'm not sure how you expect "data.table" to determine whether you're asking for the sum of 3 or the sum of the third column when you use `sum(3)`. Perhaps you need to rethink your design. – A5C1D2H2I1M1N2O1R2T1 Nov 11 '15 at 20:10
  • The output from both examples should be the same, as far as I know. Anyway, ideally whether `with = F` is in place or not shouldn't change the effect of `by = `. – Chris Nov 11 '15 at 20:11
  • @Ananada Mahto - as far as I understand `with = F` accomplishes that. If you remove `with = F` and use `sum(3)` then you will get back `3` for each record. If you leave `with = F` in, then `data.table` reads `3` as the location of your column. – Chris Nov 11 '15 at 20:15
  • Why do you use `with = F` in your second example like that? I've never seen that before and therefore highly doubt whether this is valid *data.table* syntax. It gives me an error at least. – Jaap Nov 11 '15 at 20:16
  • @Chris It is better to use the columnnames than the column numbers – Jaap Nov 11 '15 at 20:17
  • Furthermore: make your example [reproducible](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610). You are talking about a for-loop, but I don't see one in your example code. – Jaap Nov 11 '15 at 20:20
  • @eddi Thanks - that makes sense. I updated above for posterity, but will post a more comprehensive question on how I can accomplish the bigger picture problem without `with = F`. – Chris Nov 11 '15 at 20:36
  • @Jaap It is valid syntax in data.table, but as eddi explained below it's for compatibility more than usefulness. I learned it in the context of normal R programming, so it's helpful to know that I should only be using it as a last option. – Chris Nov 11 '15 at 20:43
  • Have you gone through the [vignettes](https://github.com/Rdatatable/data.table/wiki/Getting-started)? – Arun Nov 11 '15 at 23:39
  • I have not yet Arun, thanks I will look at them. – Chris Nov 12 '15 at 02:37

1 Answers1

3

A good rule of thumb for data with named columns is - never use column numbers - columns get rearranged sometimes and that can leave your code completely broken. Of course for any rule of thumb there are exceptions, but you'll need to demonstrate that your case is worth an exception, so I'll assume it's not for now.

So, if you're typing the code you'd do:

dt[!is.na(data), sum(data), by = .(group1, group2)]

And if you have the column name instead in a variable, you'd do:

col = "data"
dt[!is.na(get(col)), sum(get(col)), by = .(group1, group2)]

As for using by together with with = FALSE - that mode is designed for compatibility with data.frame, which doesn't have a by argument, but even if you had support for the by argument, the result would be trivial since the j-expression will always be interpreted as a full column in with = FALSE mode (just as in data.frame).

eddi
  • 49,088
  • 6
  • 104
  • 155
  • Great rule of thumb to learn eddi, thanks. And the context on `with = F` is very helpful in understanding what's going on behind the scenes. – Chris Nov 11 '15 at 20:42
  • 2
    @TheTime if you need to use matrix indexing, my guess would be that you chose the wrong data structure, but if you didn't - regular list indexing is compact enough for the rare occasion - e.g. `dt[[2]]` – eddi Nov 11 '15 at 20:44
  • 1
    @TheTime use column names - I don't see why you'd want to use column numbers for that – eddi Nov 11 '15 at 20:48