-1

I want to keep a sample of rows that contain a certain value, with a limit of 3 rows per value.

For example, say I want to keep a maximum of 3 rows per colour:

    X1         X2
1   0.7091409  RED
2  -1.1334614  BLUE
3   2.3343391  RED
4  -0.9040278  GREEN
5   0.4180331  RED
6   0.7572246  RED
7  -0.8996483  BLUE
8  -1.0356774  BLUE
9  -0.3983045  GREEN
10 -0.9060305  BLUE

Here, in column X2, RED appears 4 times, BLUE appears 4 times, and GREEN appears 2 times. I want to trim the rows to keep a maximum of 3 rows that includes a specific value in column X2. So the above dataset would become:

    X1         X2
1   0.7091409  RED
2  -1.1334614  BLUE
3   2.3343391  RED
4  -0.9040278  GREEN
5   0.4180331  RED
6  -0.8996483  BLUE
7  -1.0356774  BLUE
8  -0.3983045  GREEN

Any ideas on how to achieve this?

jds7117
  • 11
  • 2
  • [see here](http://stackoverflow.com/questions/34753050/data-table-select-first-n-rows-within-group) for a solution with *data.table* – Jaap Mar 22 '16 at 21:11

2 Answers2

0

Here is a base R solution with split to split into groups based on X2, mapply to apply sample to the different groups, and pmin to get the sample sizes for each group.

d2 <- split(d$X1, d$X2)
stack(mapply(sample, d2, pmin(lengths(d2), 3)))

An alternate solution could be to use stratified from the splitstackshape package:

library(splitstackshape)
stratified(d, "X2", size = pmin(lengths(split(d$X1, d$X2)), 3))

This solution works when there are more than the two columns to include in the sample.


Edit:

Handle more than two columns using the base R answer above. If all the values in X1 are unique, you can simply merge (left join) the sampled data with the original data to fill out the:

sampled_d <- stack(mapply(sample, d2, pmin(lengths(d2), 3)))  # same as base solution above
merge(sampled_d, d, 
  by.x = c("values", "ind"),
  by.y = c("X1", "X2"),
  all.x = TRUE)  # left join

Otherwise, the base solution could be adapted to the following ugly code:

d3 <- split(d, d$X2)
do.call(rbind, 
  mapply(function(ii, jj) ii[sample(nrow(ii), jj), ],
         d3, 
         pmin(lapply(d3, function(ii) dim(ii)[1]), 3),
         SIMPLIFY = FALSE))

Example data:

d <- read.table(text="
    X1         X2
1   0.7091409  RED
2  -1.1334614  BLUE
3   2.3343391  RED
4  -0.9040278  GREEN
5   0.4180331  RED
6   0.7572246  RED
7  -0.8996483  BLUE
8  -1.0356774  BLUE
9  -0.3983045  GREEN
10 -0.9060305  BLUE", header=TRUE)
Community
  • 1
  • 1
Jota
  • 17,281
  • 7
  • 63
  • 93
  • Thanks @Jota. However, the original data I'm working with has 16 columns. Can this method be implemented for more than 2 columns? It's not clear to me how. – jds7117 Mar 23 '16 at 01:50
0

I believe this solution worked for me.

# install.package(data.table) # if necessary
library(data.table)
# convert my dataframe (df2) to a datatable
DT <- data.table(df2) 

Then, perform the function that keeps a maximum of 3 of rows for a group defined by a value in column 'X2'.

DT2 <- DT[, head(.SD, 3), by = "X2"]
Community
  • 1
  • 1
jds7117
  • 11
  • 2
  • If random sampling is important, just a note that `head` is taking the first 3 or fewer observations for each group, not a random sample. If random sampling is important, you'll have to adapt the solution to something like: `DT[, lapply(.SD, sample, pmin(dim(.SD)[1], 3) ), by = "X2"]` – Jota Mar 23 '16 at 01:53