Pandas drop multiple lines instead the ones indicated by the index

Question

Considering a dataframe with insect species, specified in column 'class', I would like to drop entries that have exceeded a certain threshold in order to balance against the ones that does not have many.

    df_counts = df['class'].value_counts()
    class_balance = df_counts.where(df_counts > threshold).notnull()

    for idx, item in class_balance.iteritems():
        if item:
            if df_counts[idx] > threshold:
                n = int(df_counts[idx] - threshold)

                df_aux = df.drop(df[df['class'] == idx].sample(n=n).index)
                df_counts_b = df_aux['class'].value_counts()

so, I have iterate only over the classes that have exceeded this limit: df_counts.where(df_counts > threshold).notnull(), and I would like to update my dataframe, droping the exceeded number of rows: n, randomly: sample(n=n).

But seems it does not work in this way, like recommeded here. Note the difference between df_counts before the drop, and after first iteration:

Seems the index has been messed up. Other class have been deleted. It should be simple to delete rows conditionally, but it just behaves strange. Any clue?

score 0 · Accepted Answer · answered Dec 07 '21 at 23:36

Well, I ended up with a simple solution to that behavior, which there was nothing about the drop itself, but a bad structure of indexes. The dataframe was built of multiple CSVs, which has been loaded such as:

frames = []
for csv in csv_list:
     df = pd.read_csv(csv)
     frames.append(df)
df_main = pd.concat(frames)

the problem, then, was the lack of ignore_index in concat() procedure:

df_main = pd.concat(frames, ignore_index=True)

then, it worked fine!

Pandas drop multiple lines instead the ones indicated by the index

1 Answers1