Random Sample data based on other columns using python

Question

I have a dataframe with 100 000 rows contains Country, State, bill_ID, item_id, dates etc... columns I want to random sample 5k lines out of 100k lines which should have atleast one bill_ID from all countries and state. In short it should cover all countries and states with atleast one bill_ID.

Note: bill_ID contains multiple item_id

I am doing testing on a sampled data which should cover all unique countries and states with there bill_IDs.

You should add a [MRE](https://stackoverflow.com/help/minimal-reproducible-example) (also look [here](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples)) that replicates your problem. — Timus, Jun 13 '23 at 08:23
Please use universal measurements instead of local words like *lakh* that rest of the world does not use or understand. — James Z, Jun 13 '23 at 08:52

score 1 · Accepted Answer · answered Jun 13 '23 at 09:40

You could use Pandas' .sample method. With df your dataframe try:

sample_size = 5_000
df_sample_1 = df.groupby(["Country", "State"]).sample(1)
sample_size_2 = max(sample_size - df_sample_1.shape[0], 0)
df_sample_2 = df.loc[df.index.difference(df_sample_1.index)].sample(sample_size_2)
df_sample = pd.concat([df_sample_1, df_sample_2]).sort_index()

First group by columns Country and State and draw samples of size 1. This gives you a sample df_sample_1 that covers each Country-State-combination exactly once. Then draw the rest from the dataframe that doesn't contain the first sample: df_sample_2. Finally concatenate both samples (and sort the result if needed).

Random Sample data based on other columns using python

1 Answers1