"by" in data.table (group by) - what am I missing?

Question

I'm working with a big data.table and using 'by' to 'group by' 3 variables.

My data.table is d and has the key "ma" (10 digit integer but I have shortened it below).

But setting by="ma,year,month" (which is to me the more intuitive group by statement) does not give what I want. For example, ma = 284 has 3 entries for Nov 2011, or ma= 672 has 2 entries for Dec 2011.

> d[,list(n=length(trx_num)),by=list(ma,year,month)]
      ma year month n
  1: 284 2011    12 3
  2: 284 2012     1 1
  3: 284 2011    11 5
  4: 284 2011    11 1
  5: 284 2011    11 2
 ---
5782971: 672 2012     7 1
5782972: 672 2011    12 1
5782973: 672 2012     2 1
5782974: 672 2011    12 1
5782975: 672 2012     1 1

Reversing the 'by' order, however, gives the desired result.

> d[,list(n=length(trx_num)),by=list(month,year,ma)]
     month year ma  n
  1:    12 2011 284  3
  2:     1 2012 284  1
  3:    11 2011 284  8
  4:     5 2012 543  7
  5:     7 2012 543  3
 ---
1214686:     5 2012 672 28
1214687:     4 2012 672 13
1214688:    12 2011 672 11
1214689:     7 2012 672  9
1214690:     9 2012 672 11

What am I missing here? Thanks in advance.

EDIT:

str() of the data that gives the wrong result

> str(d)
Classes âdata.tableâ and 'data.frame':  14688135 obs. of  3 variables:
 $ ma   : num  3e+10 3e+10 3e+10 3e+10 3e+10 ...
 $ year : int  2011 2012 2011 2011 2011 2011 2011 2011 2011 2011 ...
 $ month: int  12 1 11 12 11 11 11 11 11 11 ...
 - attr(*, ".internal.selfref")=<externalptr>
 - attr(*, "sorted")= chr "ma"

str() of the wrong result:

> str(d[,.N,by=list(ma,year,month)])
Classes âdata.tableâ and 'data.frame':  5782975 obs. of  4 variables:
 $ ma   : num  3e+10 3e+10 3e+10 3e+10 3e+10 ...
 $ year : int  2011 2012 2011 2011 2011 2012 2012 2012 2012 2012 ...
 $ month: int  12 1 11 11 11 5 7 6 9 8 ...
 $ N    : int  3 1 5 1 2 1 1 1 1 1 ...
 - attr(*, ".internal.selfref")=<externalptr>

And str() of right result:

> str(d[,.N,by=list(month,year,ma)])
Classes âdata.tableâ and 'data.frame':  1214690 obs. of  4 variables:
 $ month: int  12 1 11 5 7 6 9 8 11 12 ...
 $ year : int  2011 2012 2011 2012 2012 2012 2012 2012 2011 2011 ...
 $ ma   : num  3e+10 3e+10 3e+10 3e+10 3e+10 ...
 $ N    : int  3 1 8 7 3 12 15 3 6 6 ...
 - attr(*, ".internal.selfref")=<externalptr>

I think you are missing a smaller example that illustrates this. — IRTFM, Jan 10 '13 at 20:27
Agree with [DWin](http://stackoverflow.com/users/1855677/). Without code that I can interact with in R, it is hard to diagnose what's wrong. Can you provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? — Blue Magister, Jan 10 '13 at 20:34
Also, if you want to know the number of rows in each group, you can use the idiom `.N`, like `d[,.N,by=list(month,year,ma)]`. — Blue Magister, Jan 10 '13 at 20:35
Thanks for the `.N`, didn't know about that one. It does not seem to change anything on my full dataset though - I will try and construct a reproducible example. — hhh, Jan 10 '13 at 21:09

Matt Dowle · Accepted Answer · 2013-01-11T17:46:42.193

To wrap up following the comment trail, the ma column was type numeric and contained values which were precisely different but very close together, almost within machine tolerance but not quite. In other words, this situation :

 x < y < z
 (y-x) just less than machine tolerance so considered equal
 (z-y) just less than machine tolerance so considered equal
 (z-x) just over machine tolerance so considered not equal

When such a column is grouped alongside two other columns (i.e. by= 3 columns), the order of those 3 columns, if one of those columns has values like above, can change whether those values are considered equal (and in the same group) or not.

The solution is not to use type numeric (double is another name) for such data. Use integer, or in this case where the integers were larger than 2^31 (giving rise to the coercion to double and loss of accuracy, iiuc), character instead. data.table is fast at sorting integer and character. It's not as fast at sorting double yet anyway.

We'll try and add a new warning to data.table :

FR#2469 Add new tolerance.warning option to detect and issue warning if any numeric values are close but not quite within machine tolerance

@user1853769 There you go :) Thanks to our friends in the chat room. — Matt Dowle, Jan 11 '13 at 18:29

IRTFM · Answer 2 · 2013-01-11T16:25:12.407

1

I built a small test case that at one point in this dialog I thought exhibited the unexpected behavior, (but I was reading the wrong objects for comparison):

d <-     structure(list(ma = c(284L, 284L, 284L, 284L, 284L, 284L, 284L, 
284L, 284L, 284L, 284L, 284L, 672L, 672L, 672L, 672L, 672L), 
    year = c(2011L, 2011L, 2011L, 2012L, 2011L, 2011L, 2011L, 
    2011L, 2011L, 2011L, 2011L, 2011L, 2012L, 2011L, 2012L, 2011L, 
    2012L), month = c(12L, 12L, 12L, 1L, 11L, 11L, 11L, 11L, 
    11L, 11L, 11L, 11L, 7L, 12L, 2L, 12L, 1L), trx_num = c(4L, 
    9L, 8L, 4L, 4L, 6L, 3L, 8L, 2L, 2L, 8L, 9L, 8L, 6L, 10L, 
    6L, 10L)), .Names = c("ma", "year", "month", "trx_num"), row.names = c(NA, 
-17L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x105afb0>, sorted = "ma")

To build it:

dat <- read.table(text=" ma year month n
 284 2011    12 3
 284 2012     1 1
 284 2011    11 5
 284 2011    11 1
 284 2011    11 2
 672 2012     7 1
 672 2011    12 1
 672 2012     2 1
 672 2011    12 1
 672 2012     1 1", header=TRUE)
require(data.table)
d <- data.table( data.frame(dat[rep(rownames(dat), times=dat$n), 1:3], trx_num=unlist(sapply(dat$n, sample, x=1:10)) ) )
setkey(d, ma)
d[,list(n=length(trx_num)),by=list(ma,year,month)]
d[,list(n=length(trx_num)),by=list(month,year,ma)]

At which point it becomes clear that BlueMagister's solution is correct:

d[,.N, by=list(month,year,ma)]
d[,.N, by=list(ma,year,month)] # same result modulo row order

edited Jan 11 '13 at 16:25

answered Jan 10 '13 at 20:47

IRTFM

258,963
21
364
487

Thanks DWin. So I don't get the weird result on your example data. But I still get it (with or without .N on my full data). Will try and replicate on a smaller dataset and post that. – hhh Jan 10 '13 at 21:12
OK. Even weirder then, because I did get the weird result with that object. Probably will now need to start describing machine setups. I use R 2.15.2 on a Mac 10.6.8 running data.table ver 1.8.2 – IRTFM Jan 10 '13 at 21:16
R 2.15.1 on Linux and data.table 1.8.6. So I just simply tried making a new data.table with only the two ma numbers I printed in the qn and no weird behaviour (both `length()` and `.N` work for both orderings in `by`). With the full data, though, the problem is still there. Not sure. – hhh Jan 10 '13 at 21:43
I am no longer able to document it either when I moved to my other machine that has data.table ver 1.8.6. – IRTFM Jan 10 '13 at 23:17
+1 I ran that with 1.8.2 and couldn't reproduce. DWin, you really saw duplicated groups in the result with your small example? You got me worried! I don't recall any bug fixes in this area. All I can think is that maybe one or more grouping columns are `numeric` and the numbers are close but not within machine tolerance and they just appear to be the same when printed. @user1853769, could you provide the result of `str(yourrealdata)` and `str(thebadresult)` so we can see the types please. – Matt Dowle Jan 11 '13 at 00:27
I've still got the console session open where it happened to me. Can provide the transcript. I noticed that the dput(.) on a keyed data.table object included a self-reference that I think gets in the way of just pasting it into a fresh session reproducibly. – IRTFM Jan 11 '13 at 00:30
Yep, the self ref bit needs to be excluded when pasting it, but that's ok. Transcript great! Thanks. – Matt Dowle Jan 11 '13 at 00:34
@MatthewDowle thanks for looking into this - have edited above to show column types. – hhh Jan 11 '13 at 15:11
@user1853769 Great, thanks. Likely it's the numeric `ma` column. What does `unique(d$ma)` return? And what does `print(d$ma,digits = 20)` return. If you run `d[,ma:=as.integer(ma)]` and rerun is it all fixed? I suspect that somehow you've ended up with a series of precisely different numeric values but very close to each other around machine precision. Such that in different orders they may be considered equal to the previous one, or not. – Matt Dowle Jan 11 '13 at 16:01
@MatthewDowle: I've reviewed the transcript and found that I was not reporting the results accurately. Will correct the record. – IRTFM Jan 11 '13 at 16:23
@MatthewDowle - think is has to be something with the numeric versus int but `d[,ma:=as.integer(ma)]` makes the column all NA's. Note my ma column has 11 digit numbers not 10 as I mentioned in original qn. As simple `as.integer()` on 10 digits seems ok, but not 11 - spits a warning? – hhh Jan 11 '13 at 16:49
@DWin No worries, thanks: relief! Will wait for confirm from user1853769 and if that `ma` column is the problem maybe `data.table` grouping should warn when it detects `numeric` values close together but not quite within machine tolerance (a new warning.tolerance option or something). – Matt Dowle Jan 11 '13 at 16:53
@user1853769 Ahah, so the values in `ma` are larger than 2^31 hence R has coerced to `double`. Hm. `bit64::integer64` is working in `fread` but I don't think `integer64` can be grouped yet (haven't tried). Wherever the large integers are coming from, convert/read them as `character` for now, before R converts them to real. If they're in a file use `colClasses="character"` for that column for example, to avoid losing accuracy. `data.table` is faster at sorting `character` than real, anyway, since `character` are hashed by R and data.table sneakily hooks into that hash. – Matt Dowle Jan 11 '13 at 16:58
@MatthewDowle - Confirmed! `d[,ma:=as.character(ma)]` does the trick. Both orderings of 'by' now give the same result. Great to work this out - do you want to add a quick answer so I can mark correct. Thanks heaps. – hhh Jan 11 '13 at 17:05
@user1853769 Btw, for completeness, although the `ma:=as.character(ma)` now gives the same result when grouping in each order it might not be the _right_ result. The coercion to `double` may have lost precision and converting to `character` won't get that back. The original data needs to be retained as `character` in the first place to retain that precision. Probably obvious but just in case. – Matt Dowle Jan 11 '13 at 18:46
@MathewDowle - thanks for the heads up. I'm reading the data from sql so `select cast(ma as char)` in the query seems to do the trick. – hhh Jan 12 '13 at 17:05

"by" in data.table (group by) - what am I missing?

2 Answers2