1

I am currently scraping some data from the web and an example of the data looks like this:

    col_a | col_b | col_c | col_d
    1     | 2     | 44    | home1
    1     | 3     | 44    | home1
    1     | 7     | 44    | home1
    1     | 5     | 44    | home1
    1     | 2     | 44    | home1
    1     | 3     | 44    | home1
    1     | 7     | 44    | home1
    1     | 5     | 44    | home1
    2     | 8     | 42    | home1
    2     | 6     | 42    | home1
    2     | 4     | 42    | home1
    2     | 1     | 42    | home1

As seen in the example above, there are a total of 12 rows. The correct data is supposed to only have 8 rows of data, using "col_a" as reference, each unique "col_a" is supposed to have only 4 rows. So in this case row 5 to 8 are duplicates of row 1 to 4. That being said, the data scraped has 100,000 over rows and such duplicates happen all over the place. Is there a way to keep just the first 4 rows of each unique "col_a"? I cant think of an efficient way other than looping through each row.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
user7729135
  • 399
  • 1
  • 3
  • 11

3 Answers3

1

You can get the different values in col_a with:

col_a_unique <- unique(data$col_a)

Then eventually loop over it and keep only the first four of each:

for (a in col_a_unique)
{
 to_keep=data[which(data$col_a==a),][1:4]
}

wihch() will select only the lines fulfilling the equality, i.e. all the elements correcponding to a given value on col_a, and then with [1:4] you select the first four elements.

Then you need to create an appropriate object to store to_keep at each loop step.

This is a detailed way, there are more concise way of doing it.

Chelmy88
  • 1,106
  • 1
  • 6
  • 17
0

There should be a dupe for this but since I can't find one I'll write up an answer.

Using dplyr, we could do

library(dplyr)
df %>% group_by(col_a) %>% slice(1:4)

#  col_a col_b col_c col_d
#  <int> <int> <int> <fct>
#1     1     2    44 home1
#2     1     3    44 home1
#3     1     7    44 home1
#4     1     5    44 home1
#5     2     8    42 home1
#6     2     6    42 home1
#7     2     4    42 home1
#8     2     1    42 home1

Using data.table

library(data.table)
setDT(df)[, .SD[1:4], by = col_a]

and base R.

df[with(df, ave(col_b, col_a, FUN = seq_along) <= 4), ]
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
0

Base R with the help of data.table's rowid():

df[data.table::rowid(df$col_a) < 5, ]

#    col_a col_b col_c col_d
# 1      1     2    44 home1
# 2      1     3    44 home1
# 3      1     7    44 home1
# 4      1     5    44 home1
# 9      2     8    42 home1
# 10     2     6    42 home1
# 11     2     4    42 home1
# 12     2     1    42 home1
s_baldur
  • 29,441
  • 4
  • 36
  • 69