Consider a population with skewed class distribution as in
ErrorType Samples
1 XXXXXXXXXXXXXXX
2 XXXXXXXX
3 XX
4 XXX
5 XXXXXXXXXXXX
I would like to randomly sample 20 out of 40 without undersampling any of the classes with smaller participation. For example in the above case, I would want to sample as follows
ErrorType Samples
1 XXXXX|XXXXXXXXXX
2 XXXXX|XXX
3 XX***|
4 XXX**|
5 XXXXX|XXXXXXX
i.e. 5 of Type -1 and -2 and -3, 2 of Type -3 and 3 of Type -4
- This guarantees I have sample of size as close to my target i.e. 20 samples
- None of the classes has under participation esp classes -3 and -4.
I ended up writing a circumlocutious code, but I believe there can be an easier way to utilize pandas methods or some sklearn functions.
sample_size = 20 # Just for the example
# Determine the average participaction per error types
avg_items = sample_size / len(df.ErrorType.unique())
value_counts = df.ErrorType.value_counts()
less_than_avg = value_counts[value_counts < avg_items]
offset = avg_items * len(value_counts[value_counts < avg_items]) - sum(less_than_avg)
offset_per_item = offset / (len(value_counts) - len(less_than_avg))
adj_avg = int(non_act_count / len(value_counts) + offset_per_item)
df = df.groupby(['ErrorType'],
group_keys=False).apply(lambda g: g.sample(min(adj_avg, len(g)))))