Considering a dataframe with insect species, specified in column 'class', I would like to drop entries that have exceeded a certain threshold in order to balance against the ones that does not have many.
df_counts = df['class'].value_counts()
class_balance = df_counts.where(df_counts > threshold).notnull()
for idx, item in class_balance.iteritems():
if item:
if df_counts[idx] > threshold:
n = int(df_counts[idx] - threshold)
df_aux = df.drop(df[df['class'] == idx].sample(n=n).index)
df_counts_b = df_aux['class'].value_counts()
so, I have iterate only over the classes that have exceeded this limit: df_counts.where(df_counts > threshold).notnull()
, and I would like to update my dataframe, droping the exceeded number of rows: n
, randomly: sample(n=n)
.
But seems it does not work in this way, like recommeded here. Note the difference between df_counts
before the drop, and after first iteration:
Seems the index
has been messed up. Other class have been deleted. It should be simple to delete rows conditionally, but it just behaves strange. Any clue?