How to create a smaller dataframe from an existing dataframe with the same numbers per label

Question

I have a dataframe with 50k rows and two columns, item and labels. I want to reduce the number of rows but keep the same values for all labels. So it looks like:

Label "notebook": 1000 rows
Label "ballpoint": 1000 rows
Label "pencil": 1000 rows
Label "eraser": 1000 rows
Label "pencil sharpener": 1000 rows

So from 50k rows, it reduces to only 5000 rows with the same number of rows for each label.

`.groupby('label').head(1000)`? Is that what you're looking for? If not, make a [mre]. For specifics, see [How to make good reproducible pandas examples](/q/20109391/4518341). Like, for the sake of example, you could probably do with as little as two labels with 5 rows each down to 2 rows each. — wjandrea, Jun 28 '23 at 22:05

Himanshu Panwar · Accepted Answer · 2023-06-28T22:36:56.850

1

You need to perform stratified sampling which simply means converting your data into groups and then sample from each group.

The sampling could be proportionate or disproportionate. Since you have already mentioned that you want 1000 rows for each label, go for disproportionate sampling. The sample code is below:

data = {    
    "item": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    "label": ['A', 'B', 'A', 'C', 'B', 'B', 'A', 'C', 'A', 'B'],
}
df = pd.DataFrame(data)

# Sampling two rows for each labels
df.groupby("label").sample(n=2)
print(df)

   item label
0   3   A
1   7   A
2   6   B
3   5   B
4   4   C
5   8   C

edited Jun 28 '23 at 22:36

answered Jun 28 '23 at 22:06

Himanshu Panwar

216
2
7

I think you can just use [`.sample`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.sample.html) instead of `.apply(.sample)`. – wjandrea Jun 28 '23 at 22:17
No ,It will give error. Here sample() is extracting 2 row from each group i.e. x – Himanshu Panwar Jun 28 '23 at 22:32
What's the error? It worked fine for me. – wjandrea Jun 28 '23 at 22:33
1

My bad, I wrote .apply(sample). Yes it working fine, more elegant :) – Himanshu Panwar Jun 28 '23 at 22:35

How to create a smaller dataframe from an existing dataframe with the same numbers per label

1 Answers1