from data table, randomly select one row per group

Question

I'm looking for an efficient way to select rows from a data table such that I have one representative row for each unique value in a particular column.

Let me propose a simple example:

require(data.table)

y = c('a','b','c','d','e','f','g','h')
x = sample(2:10,8,replace = TRUE)
z = rep(y,x)
dt = as.data.table( z )

my objective is to subset data table dt by sampling one row for each letter a-h in column z.

akrun · Accepted Answer · 2015-11-24T07:08:56.340

15

OP provided only a single column in the example. Assuming that there are multiple columns in the original dataset, we group by 'z', sample 1 row from the sequence of rows per group, get the row index (.I), extract the column with the row index ($V1) and use that to subset the rows of 'dt'.

dt[dt[ , .I[sample(.N,1)] , by = z]$V1]

edited Nov 24 '15 at 07:08

answered Nov 24 '15 at 06:47

akrun

874,273
37
540
662

please add relevant info as well. – SMR Nov 24 '15 at 07:06
2

this is perfect, just what I was looking for. Yes, assume multiple columns, my problem data table has over 200 columns, but I kept the example simple for clarity and comprehension (on my part). – Kerry Nov 24 '15 at 18:55

score 6 · Answer 2 · edited Mar 14 '16 at 22:35

6

You can use dplyr

library(dplyr)

dt %>%
  group_by(z) %%
  sample_n(1)

edited Mar 14 '16 at 22:35

thelatemail

91,185
12
128
188

answered Nov 24 '15 at 07:42

bramtayl

4,004
2
11
18

4

Please provide some explanation of why your proposal solves the OP's problem. – APC Nov 24 '15 at 07:50
Seems to be a type here. Try `group_by(z) %>%` instead of `group_by(z) %%`. – cibr Apr 27 '18 at 12:58

Valentin_Ștefan · Answer 3 · 2017-09-10T22:00:41.063

I think that shuffling the data.table row-wise and then applying unique(...,by) could also work. Groups are formed with by and the previous shuffling trickles down inside each group:

# shuffle the data.table row-wise
dt <- dt[sample(dim(dt)[1])]
# uniqueness by given column(s)
unique(dt, by = "z")

Below is an example on a bigger data.table with grouping by 3 columns. Comparing with @akrun ' solution seems to give the same grouping:

set.seed(2017)
dt <- data.table(c1 = sample(52*10^6), 
                 c2 = sample(LETTERS, replace = TRUE), 
                 c3 = sample(10^5, replace = TRUE), 
                 c4 = sample(10^3, replace = TRUE))
# the shuffling & uniqueness
system.time( test1 <- unique(dt[sample(dim(dt)[1])], by = c("c2","c3","c4")) )
# user  system elapsed 
# 13.87    0.49   14.33 

# @akrun' solution
system.time( test2 <- dt[dt[ , .I[sample(.N,1)] , by = c("c2","c3","c4")]$V1] )
# user  system elapsed 
# 11.89    0.10   12.01 

# Grouping is identical (so, all groups are being sampled in both cases)
identical(x=test1[,.(c2,c3)][order(c2,c3)], 
          y=test2[,.(c2,c3)][order(c2,c3)])
# [1] TRUE

For sampling more than one row per group check here

score 0 · Answer 4 · answered Sep 28 '22 at 16:56

Updated workflow for dplyr. I added a second column v that can be grouped by z.

require(data.table)

y = c('a','b','c','d','e','f','g','h')
x = sample(2:10,8,replace = TRUE)
z = rep(y,x)
v <- 1:length(z)
dt = data.table(z,v)




library(dplyr)

dt %>%
  group_by(z) %>%
  slice_sample(n = 1)

from data table, randomly select one row per group

4 Answers4

Linked

Related