0

I have a dataframe with 50k rows and two columns, item and labels. I want to reduce the number of rows but keep the same values for all labels. So it looks like:

  • Label "notebook": 1000 rows
  • Label "ballpoint": 1000 rows
  • Label "pencil": 1000 rows
  • Label "eraser": 1000 rows
  • Label "pencil sharpener": 1000 rows

So from 50k rows, it reduces to only 5000 rows with the same number of rows for each label.

  • `.groupby('label').head(1000)`? Is that what you're looking for? If not, make a [mre]. For specifics, see [How to make good reproducible pandas examples](/q/20109391/4518341). Like, for the sake of example, you could probably do with as little as two labels with 5 rows each down to 2 rows each. – wjandrea Jun 28 '23 at 22:05

1 Answers1

1

You need to perform stratified sampling which simply means converting your data into groups and then sample from each group.

The sampling could be proportionate or disproportionate. Since you have already mentioned that you want 1000 rows for each label, go for disproportionate sampling. The sample code is below:

data = {    
    "item": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    "label": ['A', 'B', 'A', 'C', 'B', 'B', 'A', 'C', 'A', 'B'],
}
df = pd.DataFrame(data)

# Sampling two rows for each labels
df.groupby("label").sample(n=2)
print(df)
   item label
0   3   A
1   7   A
2   6   B
3   5   B
4   4   C
5   8   C
Himanshu Panwar
  • 216
  • 2
  • 7