R remove rows from dataframe after occurrences of a value reach a limit

Question

I have an R dataframe sorted by the first value.

There are many different rows with each first value.

I want to keep the first 200 rows with each first value, and remove all the others.

So for example if I start with 300 "1 whatever..." rows and 400 "2 whatever..." rows, what I want is 400 rows: the first 200 "1" rows, then the first 200 "2" rows.

Thanks in advance...

Please make this question *reproducible*. This includes sample code (including listing non-base R packages), sample data (e.g., `dput(head(x))`), and expected output. Refs: https://stackoverflow.com/questions/5963269, https://stackoverflow.com/help/mcve, and https://stackoverflow.com/tags/r/info. — r2evans, Nov 11 '18 at 04:43
Though if I had to guess, it could be something like `do.call(rbind.data.frame, by(mtcars, mtcars$cyl, head, n=3))`. — r2evans, Nov 11 '18 at 04:43

morgan121 · Answer 1 · 2018-11-11T10:17:25.043

0

Please make answers reproducible in the future and also include information about what steps you have already tried. Example data is another useful tool to help us answer you more quickly.

Here is a little example I made up using the dplyr package:

library(dplyr)    # group_by() and top_n() 
library(magrittr) # %>% - piping function

data <- data.frame(X=c(rep(1,300),rep(2,300)), Y=1:600)

subdata <- data %>%
    group_by(X) %>%
    top_n(200)

This will end with 400 rows, 200 '1' rows and 200 '2' rows. Let me know if you have any issues.

edited Nov 11 '18 at 10:17

answered Nov 11 '18 at 10:09

morgan121

2,213
1
15
33

Thanks; this worked, partly. I did everything you said, working with my dataframe; but when I exported subdata and looked at it, there were 247 lines with the first value of userID (the first column in my dataframe), then 222 lines with the next value of userID, then 215 lines with the next value, then 235 with the next, etc. So this is pruning the number of rows for each userID, but not uniformly. I haven't used dplyr before, and I don't know why. – Phil Rennert Nov 12 '18 at 17:32
Thats weird, can you show me your data using the `dput(data)` command? There could be some weirdness going on if some things are factors or something, but I can do some testing to see if I can reproduce your issue – morgan121 Nov 12 '18 at 20:23
Okay, thanks. I output the file with dput. It's 219K, 3400 lines (original is 12,600 lines). I can't just attach it, can I? The first lines look like: structure(list(userID = c(78L, 78L, 78L, 78L, 78L, 78L, 78L, 78L, 78L, 78L, 78L, 78L, 78L, 78L, 78L, 78L, 78L, 78L, 78L, 78L, and so no for all the lines for user 78, then the next one... can I get it to you? – Phil Rennert 5 mins ago – Phil Rennert Nov 14 '18 at 22:16

R remove rows from dataframe after occurrences of a value reach a limit

1 Answers1