10

I'm looking for an efficient way to select rows from a data table such that I have one representative row for each unique value in a particular column.

Let me propose a simple example:

require(data.table)

y = c('a','b','c','d','e','f','g','h')
x = sample(2:10,8,replace = TRUE)
z = rep(y,x)
dt = as.data.table( z )

my objective is to subset data table dt by sampling one row for each letter a-h in column z.

Kerry
  • 411
  • 4
  • 13

4 Answers4

15

OP provided only a single column in the example. Assuming that there are multiple columns in the original dataset, we group by 'z', sample 1 row from the sequence of rows per group, get the row index (.I), extract the column with the row index ($V1) and use that to subset the rows of 'dt'.

dt[dt[ , .I[sample(.N,1)] , by = z]$V1]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • please add relevant info as well. – SMR Nov 24 '15 at 07:06
  • 2
    this is perfect, just what I was looking for. Yes, assume multiple columns, my problem data table has over 200 columns, but I kept the example simple for clarity and comprehension (on my part). – Kerry Nov 24 '15 at 18:55
6

You can use dplyr

library(dplyr)

dt %>%
  group_by(z) %%
  sample_n(1)
thelatemail
  • 91,185
  • 12
  • 128
  • 188
bramtayl
  • 4,004
  • 2
  • 11
  • 18
0

I think that shuffling the data.table row-wise and then applying unique(...,by) could also work. Groups are formed with by and the previous shuffling trickles down inside each group:

# shuffle the data.table row-wise
dt <- dt[sample(dim(dt)[1])]
# uniqueness by given column(s)
unique(dt, by = "z")

Below is an example on a bigger data.table with grouping by 3 columns. Comparing with @akrun ' solution seems to give the same grouping:

set.seed(2017)
dt <- data.table(c1 = sample(52*10^6), 
                 c2 = sample(LETTERS, replace = TRUE), 
                 c3 = sample(10^5, replace = TRUE), 
                 c4 = sample(10^3, replace = TRUE))
# the shuffling & uniqueness
system.time( test1 <- unique(dt[sample(dim(dt)[1])], by = c("c2","c3","c4")) )
# user  system elapsed 
# 13.87    0.49   14.33 

# @akrun' solution
system.time( test2 <- dt[dt[ , .I[sample(.N,1)] , by = c("c2","c3","c4")]$V1] )
# user  system elapsed 
# 11.89    0.10   12.01 

# Grouping is identical (so, all groups are being sampled in both cases)
identical(x=test1[,.(c2,c3)][order(c2,c3)], 
          y=test2[,.(c2,c3)][order(c2,c3)])
# [1] TRUE

For sampling more than one row per group check here

Valentin_Ștefan
  • 6,130
  • 2
  • 45
  • 68
0

Updated workflow for dplyr. I added a second column v that can be grouped by z.

require(data.table)

y = c('a','b','c','d','e','f','g','h')
x = sample(2:10,8,replace = TRUE)
z = rep(y,x)
v <- 1:length(z)
dt = data.table(z,v)




library(dplyr)

dt %>%
  group_by(z) %>%
  slice_sample(n = 1)
canderson156
  • 1,045
  • 10
  • 24