I am currently scraping some data from the web and an example of the data looks like this:
col_a | col_b | col_c | col_d
1 | 2 | 44 | home1
1 | 3 | 44 | home1
1 | 7 | 44 | home1
1 | 5 | 44 | home1
1 | 2 | 44 | home1
1 | 3 | 44 | home1
1 | 7 | 44 | home1
1 | 5 | 44 | home1
2 | 8 | 42 | home1
2 | 6 | 42 | home1
2 | 4 | 42 | home1
2 | 1 | 42 | home1
As seen in the example above, there are a total of 12 rows. The correct data is supposed to only have 8 rows of data, using "col_a" as reference, each unique "col_a" is supposed to have only 4 rows. So in this case row 5 to 8 are duplicates of row 1 to 4. That being said, the data scraped has 100,000 over rows and such duplicates happen all over the place. Is there a way to keep just the first 4 rows of each unique "col_a"? I cant think of an efficient way other than looping through each row.