0

I have a dataframe with 100 000 rows contains Country, State, bill_ID, item_id, dates etc... columns I want to random sample 5k lines out of 100k lines which should have atleast one bill_ID from all countries and state. In short it should cover all countries and states with atleast one bill_ID.

Note: bill_ID contains multiple item_id

I am doing testing on a sampled data which should cover all unique countries and states with there bill_IDs.

James Z
  • 12,209
  • 10
  • 24
  • 44
Alpha Beta
  • 39
  • 8
  • 1
    You should add a [MRE](https://stackoverflow.com/help/minimal-reproducible-example) (also look [here](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples)) that replicates your problem. – Timus Jun 13 '23 at 08:23
  • 1
    Please use universal measurements instead of local words like *lakh* that rest of the world does not use or understand. – James Z Jun 13 '23 at 08:52

1 Answers1

1

You could use Pandas' .sample method. With df your dataframe try:

sample_size = 5_000
df_sample_1 = df.groupby(["Country", "State"]).sample(1)
sample_size_2 = max(sample_size - df_sample_1.shape[0], 0)
df_sample_2 = df.loc[df.index.difference(df_sample_1.index)].sample(sample_size_2)
df_sample = pd.concat([df_sample_1, df_sample_2]).sort_index()

First group by columns Country and State and draw samples of size 1. This gives you a sample df_sample_1 that covers each Country-State-combination exactly once. Then draw the rest from the dataframe that doesn't contain the first sample: df_sample_2. Finally concatenate both samples (and sort the result if needed).

Timus
  • 10,974
  • 5
  • 14
  • 28