How can subset a dataframe by nrow and groups in r?

Question

I have a dataframe that contains 240,000 obs. of 7 variables. In the dataframe there are 100 groups of 2400 records each, by Symbol. Example:

Complete DataFrame

I want to split this dataframe in new dataframe that contains every first observation and each 240 observation.
The new dataframe will be 1000 obs of 7 variables:

New DataFrame

I tried df[seq(1, nrow(df), 240), ] but the new dataframe has each 240 observation and not distinguished by group (Symbol). I mean, I want a new dataframe that contains the rows 240, 480, 720, 960, and so on, for each symbol. In the original data frame every symbol has 2400 obs thus the new dataframe will have 10 obs by group.

You simply want to generate `c(1,240)` inside each group of 2400? Can we assume that all rows for `Symbol=='AAA'` come first, followed by `AAB` etc.? Then we only need to generate row numbers, not groupby. — smci, Sep 05 '18 at 21:49
Possible duplicate of [Select first and last row from grouped data](https://stackoverflow.com/questions/31528981/select-first-and-last-row-from-grouped-data) — tjebo, Sep 05 '18 at 21:58

Jilber Urbina · Accepted Answer · 2018-09-05T23:27:26.660

Since we don't have your data, we can use an R database: iris. In this example we split iris by Species and select first n rows using head, in this example I set n=5 to extract first 5 rows by Species

> split_data <- lapply(split(iris, iris$Species), head, n=5)
> do.call(rbind, split_data)
              Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
setosa.1               5.1         3.5          1.4         0.2     setosa
setosa.2               4.9         3.0          1.4         0.2     setosa
setosa.3               4.7         3.2          1.3         0.2     setosa
setosa.4               4.6         3.1          1.5         0.2     setosa
setosa.5               5.0         3.6          1.4         0.2     setosa
versicolor.51          7.0         3.2          4.7         1.4 versicolor
versicolor.52          6.4         3.2          4.5         1.5 versicolor
versicolor.53          6.9         3.1          4.9         1.5 versicolor
versicolor.54          5.5         2.3          4.0         1.3 versicolor
versicolor.55          6.5         2.8          4.6         1.5 versicolor
virginica.101          6.3         3.3          6.0         2.5  virginica
virginica.102          5.8         2.7          5.1         1.9  virginica
virginica.103          7.1         3.0          5.9         2.1  virginica
virginica.104          6.3         2.9          5.6         1.8  virginica
virginica.105          6.5         3.0          5.8         2.2  virginica
>

Update

Given your comment, try this using your data.frame:

ind <- seq(from=240, to=240000, by=240) # a row index of length = 1000
split_data <- lapply(split(yourData, yourData$Symbol), function(x) x[ind,] )
do.call(rbind, split_data)

Hi, thanks for the answer. I didn´t explain well. I want a new dataframe that contains the rows 240, 480, 720, 960, and so on, for each symbol. In the original data frame every symbol has 2400 obs thus the new dataframe will have 10 obs by group. — El3030, Sep 05 '18 at 22:57

Rui Barradas · Answer 2 · 2018-09-06T02:47:50.827

Here is one way using base R.
just like in the answer by user @Jilber Urbina I will give an example use with the built-in dataset iris.

fun <- function(DF, n = 240, start = n){
  DF[seq(start, NROW(DF), by = n), ]
}

res <- lapply(split(iris, iris$Species), fun, n = 24)
res <- do.call(rbind, res)  
row.names(res) <- NULL
res
#  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#1          5.1         3.3          1.7         0.5     setosa
#2          4.6         3.2          1.4         0.2     setosa
#3          6.1         2.8          4.7         1.2 versicolor
#4          6.2         2.9          4.3         1.3 versicolor
#5          6.3         2.7          4.9         1.8  virginica
#6          6.5         3.0          5.2         2.0  virginica

This can be made into a function, I named selectStepN.

#
# x - dataset to subset
# f - a factor, split criterion
# n - the step
#
selectStepN <- function(x, f, n = 240, start = n){
  fun <- function(DF, n){
    DF[seq(start, NROW(DF), by = n), ]
  }
  res <- lapply(split(x, f), fun, n = n)
  res <- do.call(rbind, res)  
  row.names(res) <- NULL
  res
}

selectStepN(iris, iris$Species, 24)
#  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#1          5.1         3.3          1.7         0.5     setosa
#2          4.6         3.2          1.4         0.2     setosa
#3          6.1         2.8          4.7         1.2 versicolor
#4          6.2         2.9          4.3         1.3 versicolor
#5          6.3         2.7          4.9         1.8  virginica
#6          6.5         3.0          5.2         2.0  virginica

Hi, thanks for the answer. I didn´t explain well. I want a new dataframe that contains the rows 240, 480, 720, 960, and so on, for each symbol. In the original data frame every symbol has 2400 obs thus the new dataframe will have 10 obs by group. — El3030, Sep 05 '18 at 22:58
@El3030 In the question the new df starts the groups at row 1, not 240. Code edited to start at 240 or any other value of the argument `n`. In the example I have set `n = 24`. — Rui Barradas, Sep 06 '18 at 02:48

How can subset a dataframe by nrow and groups in r?

2 Answers2