3

I want to get both the minimum and a random sample of a variable by a group, in data.table.

data.table(ggplot2::movies)[, list(min=min(rating), random=sample(rating, 1)), by=list(year, Action)]

does not work:

Error in `[.data.table`(data.table(movies), , list(min(rating), sample(rating,  : 
Column 2 of result for group 88 is type 'integer' but expecting type 'double'. Column types must be consistent for each group.

If I force it to numerical, I get this astonishing result: categories whose random rating is under (?!!) the minimum of the same category.

data.table(ggplot2::movies)[, list(min=min(rating), random=as.numeric(sample(rating, 1))), by=list(year, Action)][random<min]
   year Action min random
1: 1916      1 6.2      6
2: 1911      1 5.7      1
3: 1901      1 4.2      3
4: 1914      1 6.1      6
5: 1923      1 8.2      4
6: 1918      1 5.9      5
7: 1921      1 7.5      4

Using .SD does not change anything:

data.table(ggplot2::movies)[, list(min=min(rating), random=as.numeric(sample(.SD$rating, 1))), by=list(year, Action)][random<min]
   year Action min random
1: 1916      1 6.2      2
2: 1911      1 5.7      4
3: 1893      0 7.0      2
4: 1901      1 4.2      4
5: 1914      1 6.1      5
6: 1923      1 8.2      8
7: 1918      1 5.9      4

And the worse is that no error arise when the variable is integer:

data.table(ggplot2::movies)[, list(min=min(votes), random=sample(votes, 1)), by=list(year, Action)][random<min]
   year Action min random
1: 1916      1 135     43
2: 1911      1  26      2
3: 1893      0  90     52
4: 1901      1  13     12
5: 1923      1 757    368
6: 1918      1  60     49
7: 1921      1  73     48

Apparently the sample function does not want to work on the subset...

Help!

Arthur
  • 1,208
  • 13
  • 25
  • Please provide example data. http://stackoverflow.com/a/28481250/1191259 Also, do you want to mention the package you're getting `%>%` from? magrittr, perhaps? – Frank Oct 09 '15 at 14:31
  • Fwiw, `data.table(iris)[, .(min = min(Sepal.Length), rand = sample(Sepal.Length,1)), by=.(Species)]` I'm curious if you found a bug, but we have no way to confirm it without an example. – Frank Oct 09 '15 at 14:37
  • 2
    I can remove the `%>%`s and add `ggplot2::` in front of `movies. I edit my question. – Arthur Oct 09 '15 at 14:59
  • Thanks. I am also puzzled by this. Hopefully someone else will have some insight. – Frank Oct 09 '15 at 15:06

2 Answers2

2

You fell into the standard sample trap. From ?sample:

If x has length 1, is numeric (in the sense of is.numeric) and x >= 1, sampling via sample takes place from 1:x. Note that this convenience feature may lead to undesired behaviour when x is of varying length in calls such as sample(x).

Use e.g. the resample suggestion from ?sample.

eddi
  • 49,088
  • 6
  • 104
  • 155
  • 1
    @Arthur You don't need to load `gdata` for resample. As mentioned by eddi, it is in the examples section of the doc for `?sample`, defined as `resample <- function(x, ...) x[sample.int(length(x), ...)]` – Frank Oct 09 '15 at 17:47
0

I finally found an work-around. But it does not tell why sample() is not working as expected on the subset.

data.table(movies)[, list(min=min(votes), random=votes[sample(1:.N, 1)]), by=list(year, Action)]
     year Action min random
  1: 1971      0   5     77
  2: 1939      0   5     13
  3: 1941      0   5      7
  4: 1996      0   5   4066
  5: 1975      0   5      6
 ---                       
201: 1931      1   8      8
202: 1928      1  17     41
203: 1923      1 757    757
204: 1918      1  60     60
205: 1921      1  73     73

This is the end of the strange behaviour described earlier:

data.table(movies)[, list(min=min(votes), random=votes[sample(1:.N, 1)]), by=list(year, Action)][random<min]
Empty data.table (0 rows) of 4 cols: year,Action,min,random
Arthur
  • 1,208
  • 13
  • 25