32

I have a large data.table (about 24000 rows and growing). I want to subset that datatable based on a couple of criteria and from that subset (ends up being about 3000 rows) I want to randomly sample just 4 rows. I do not want to create a named 3000 or so row data.table, count its rows and then sample based on row number. How can I do it on the fly? Or should I just suck it up by creating the table and then working on it, sampling it and then using rm() to get rid of it?

Lets simulate my issue

require(data.table)
random.length  <-  sample(x = 15:30, size = 1)
data.table(city=sample(c("Cape Town", "New York", "Pittsburgh", "Tel Aviv", "Amsterdam"), size=random.length, replace = TRUE), score = sample(x=1:10, size = random.length, replace=TRUE)) 

That makes a random length table, which simulates the fact that depending on my criteria and depending on my starting table, I do not know what the length of the subsetted table with be

Now, if I just wanted the first three rows I could do as so

data.table(city=sample(c("Cape Town", "New York", "Pittsburgh", "Tel Aviv", "Amsterdam"), size=random.length, replace = TRUE), score = sample(x=1:10, size = random.length, replace=TRUE))[1:3]

But let us say I did not want the first three rows but rather a random 3 rows, then I would want to do something such as this...

data.table(city=sample(c("Cape Town", "New York", "Pittsburgh", "Tel Aviv", "Amsterdam"), size=random.length, replace = TRUE), score = sample(x=1:10, size = random.length, replace=TRUE))[sample(x= 1:number of rows of that previous data.table,size = 3 ]

That will not work. How do I compute, on the fly, what the length of the initial data.frame was?

Farrel
  • 10,244
  • 19
  • 61
  • 99

3 Answers3

66

Have just made .N work in i. New README item :

.N is now available in i, FR#724. Thanks to newbie indirectly here and Farrel directly here.

This now works :

DT[...][...][sample(.N,3)]

e.g.

> random.length  <-  sample(x = 15:30, size = 1)
> data.table(city = sample(c("Cape Town", "New York", "Pittsburgh", "Tel Aviv", "Amsterdam"),size=random.length, replace = TRUE), score = sample(x=1:10, size = random.length, replace=TRUE))[sample(.N, 3)] 
         city score
1:   New York     4
2: Pittsburgh     3
3:  Cape Town     9
> 
Community
  • 1
  • 1
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • 2
    Darn, darn good. Thank you. That's just what I wanted. I realized I had to go install the latest version from GitHub. I ran into problems with `pdflatex is not available`. I read somewhere about the build_vignettes=F argument and all worked well after that. `install_github("data.table", "Rdatatable", build_vignettes=F)` – Farrel Jul 11 '14 at 04:56
  • 3
    why not just `DT[sample(.N,3)]`? – Ufos Mar 13 '19 at 16:50
  • 5
    @ufos the `[...][...]` part was just to convey the general case of when used in chaining, the `.N` refers to the last part of the chain (not the original `DT` at the beginning of the chain). That general case covers `DT[sample(.N,3)]` too. – Matt Dowle Mar 25 '19 at 18:05
  • @MattDowle how do you sample without replacement? – Herman Toothrot Nov 16 '21 at 17:50
  • @HermanToothrot `sample(.N,3)` is already sampling without replacement. To sample with replacement, `sample(.N,3,replace=TRUE)`. See `?sample`. – Matt Dowle Nov 17 '21 at 19:50
3

There is a two step approach:

  1. Compute the index i using .I
  2. Sample on index i

Example code.

require(data.table)
random.length  <-  sample(x = 15:30, size = 1)
data.table(city = sample(c("Cape Town", "New York", "Pittsburgh", "Tel Aviv", "Amsterdam"),size=random.length, replace = TRUE), score = sample(x=1:10, size = random.length, replace=TRUE))[,i := .I][sample(i, 3)]
djhurio
  • 5,437
  • 4
  • 27
  • 48
1

Another alternative way would be to use sapply approach.
For example:

  as.data.table(sapply(DT[], sample, 10))
Daniel
  • 1,202
  • 2
  • 16
  • 25
  • nice, thanks. But note that this does not preserve the data type. I get every column transformed into numeric. – Sara Feb 13 '19 at 15:31