Subsetting of rows in R data frame

Question

I am currently scraping some data from the web and an example of the data looks like this:

    col_a | col_b | col_c | col_d
    1     | 2     | 44    | home1
    1     | 3     | 44    | home1
    1     | 7     | 44    | home1
    1     | 5     | 44    | home1
    1     | 2     | 44    | home1
    1     | 3     | 44    | home1
    1     | 7     | 44    | home1
    1     | 5     | 44    | home1
    2     | 8     | 42    | home1
    2     | 6     | 42    | home1
    2     | 4     | 42    | home1
    2     | 1     | 42    | home1

As seen in the example above, there are a total of 12 rows. The correct data is supposed to only have 8 rows of data, using "col_a" as reference, each unique "col_a" is supposed to have only 4 rows. So in this case row 5 to 8 are duplicates of row 1 to 4. That being said, the data scraped has 100,000 over rows and such duplicates happen all over the place. Is there a way to keep just the first 4 rows of each unique "col_a"? I cant think of an efficient way other than looping through each row.

Using `dplyr`, `df %>% group_by(col_a) %>% slice(1:4) ` – Ronak Shah Aug 14 '19 at 08:39 — Ronak Shah, Aug 14 '19 at 08:39
this works perfectly rather than looping thanks! – user7729135 Aug 14 '19 at 08:48 — user7729135, Aug 14 '19 at 08:48

score 1 · Answer 1 · answered Aug 14 '19 at 08:44

You can get the different values in col_a with:

col_a_unique <- unique(data$col_a)

Then eventually loop over it and keep only the first four of each:

for (a in col_a_unique)
{
 to_keep=data[which(data$col_a==a),][1:4]
}

wihch() will select only the lines fulfilling the equality, i.e. all the elements correcponding to a given value on col_a, and then with [1:4] you select the first four elements.

Then you need to create an appropriate object to store to_keep at each loop step.

This is a detailed way, there are more concise way of doing it.

Ronak Shah's slice method from dplyr works perfectly without looping. Thank you for the suggestion though! — user7729135, Aug 14 '19 at 08:50

score 0 · Accepted Answer · answered Aug 14 '19 at 08:50

There should be a dupe for this but since I can't find one I'll write up an answer.

Using dplyr, we could do

library(dplyr)
df %>% group_by(col_a) %>% slice(1:4)

#  col_a col_b col_c col_d
#  <int> <int> <int> <fct>
#1     1     2    44 home1
#2     1     3    44 home1
#3     1     7    44 home1
#4     1     5    44 home1
#5     2     8    42 home1
#6     2     6    42 home1
#7     2     4    42 home1
#8     2     1    42 home1

Using data.table

library(data.table)
setDT(df)[, .SD[1:4], by = col_a]

and base R.

df[with(df, ave(col_b, col_a, FUN = seq_along) <= 4), ]

score 0 · Answer 3 · answered Aug 14 '19 at 08:52

Base R with the help of data.table's rowid():

df[data.table::rowid(df$col_a) < 5, ]

#    col_a col_b col_c col_d
# 1      1     2    44 home1
# 2      1     3    44 home1
# 3      1     7    44 home1
# 4      1     5    44 home1
# 9      2     8    42 home1
# 10     2     6    42 home1
# 11     2     4    42 home1
# 12     2     1    42 home1

Subsetting of rows in R data frame

3 Answers3