0

I have a data frame which could be approximated by the following example df:

a  <- seq(1, 1010, 1)
b  <- seq(2,1011,1)
c  <- c(rep(1,253), rep(2, 252), rep(3,254), rep(4,251))
d  <- c(rep(5,253), rep(6, 252), rep(7,254), rep(8,251))
df <- data.frame(a,b,c,d)

Firstly I group my observations based on columns c and d. Then I want to have equal amount of observations (n=250) in each group. Basically, I want to remove the last rows of each group if they exceed the threshold of 250.

It is pretty easy to do with if, however it does take a plenty of time. Any help will be highly appreciated.

Arun
  • 116,683
  • 26
  • 284
  • 387
  • Does something like this `df[ df$a < 250, ]` work? – sckott May 11 '14 at 15:46
  • I do not think so because it takes the rows in column a which is less than 250, but my question is about number of observations in each group grouped by column c and d – user3618375 May 11 '14 at 15:54

2 Answers2

1

An example using package plyr:

library(plyr)
ddply(df, .(c, d), function(DF) head(DF, 250))
Roland
  • 127,288
  • 10
  • 191
  • 288
1

Since speed seems to be an issue, you could use dplyr which is faster than plyr:

require(dplyr)
df %.% group_by(c,d) %.% mutate(count = 1:n()) %.% filter(count <= 250)
df$count <- NULL
talat
  • 68,970
  • 21
  • 126
  • 157
  • @user3618375 happy to hear that :) by the way, i realized that you can probably write it this way: `df %.% group_by(c,d) %.% mutate(count = 1:n()) %.% filter(count <= 250) %.% select(-count)` so that you dont need the `df$count <- NULL` – talat May 11 '14 at 21:55