1

Problem

I am working on a machine learning project which aims to see on what kind of raw data (text) the classifiers tend to make mistakes and on what kind of data they have no consensus.

Now I have a dataframe with labels, prediction results of 2 classifiers and text data. I am wondering if there is a simple way I could select rows based on some set operations of those columns with predictions or labels.

Data might look like

   score                                             review     svm_pred  dnn_pred
0      0  I went and saw this movie last night after bei...            0         1
1      1  Actor turned director Bill Paxton follows up h...            1         1
2      1  As a recreational golfer with some knowledge o...            0         1
3      1  I saw this film in a sneak preview, and it is ...            1         1
4      1  Bill Paxton has taken the true story of the 19...            1         1
5      1  I saw this film on September 1st, 2005 in Indi...            1         1
6      1  Maybe I'm reading into this too much, but I wo...            0         1
7      1  I felt this film did have many good qualities....            1         1
8      1  This movie is amazing because the fact that th...            1         1
9      0  "Quitting" may be as much about exiting a pre-...            1         1


For example, I want to select rows both make mistakes, then the index 9 will be returned.

A made-up MWE data example is provided here

import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 2, 30).reshape(10, 3), columns=["score", "svm_pred", "dnn_pred"])

which returns

   score  svm_pred  dnn_pred
0      0         1         0
1      0         0         1
2      0         0         0
3      1         0         0
4      0         0         1
5      0         1         1
6      1         0         1
7      0         1         1
8      1         1         1
9      1         1         1

What I Have Done

I know I could list all possible combinations, 000, 001, etc. However,

  • This is not doable when I want to compare more classifiers.
  • This will not work for multi-class classification problem.

Could someone help me, thank you in advance.

Why This Question is Not a Duplicate

The existing answers only consider the case where the number of columns are limited. However, in my application, the number of predictions given by classifier (i.e. columns) could be large and this makes the existing answer not quite applicable.

At the same time, the use of pd.Series.ne function is first seen to use this in particular application, which might shed some light to people with similar confusion.

Mr.Robot
  • 349
  • 1
  • 16
  • Possible duplicate of [Select rows from a DataFrame based on values in a column in pandas](https://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas) – PV8 Jun 07 '19 at 09:05
  • Assuming `score` is always the first column, maybe try something like `df.iloc[:, 1:].ne(df.iloc[:, 0], axis=0).sum(axis=1)`. That will return the number of incorrect classifiers. So simple boolean indexing where that value is equal to the number of classifiers would return rows where all classifiers got it wrong – Chris Adams Jun 07 '19 at 09:17

2 Answers2

1

You can use set operations on the selection of rows:

# returns indexes of those rows where score is equal to svm prediction and dnn prediction
df[(df['score'] == df['svm_pred']) & (df['score'] == df['dnn_pred'])].index


 # returns indexes of those rows where both predictions are wrong
 df[(df['score'] != df['svm_pred']) & (df['score'] != df['dnn_pred'])].index

 # returns indexes of those rows where either predictions are wrong
 df[(df['score'] != df['svm_pred']) | (df['score'] != df['dnn_pred'])].index

If you are not only interested in the index, but the complete row, omit the last part:

# returns rows where either predictions are wrong
df[(df['score'] != df['svm_pred']) | (df['score'] != df['dnn_pred'])]
ABot
  • 197
  • 12
1

Create a helper Series of "number of incorrect classifers" that you can do logical operations on. This makes the assumption that true score is in column 1 and subsequent prediction values are in columns 2-onwards - You may need to update the slicing indices accordingly

s = df.iloc[:, 1:].ne(df.iloc[:, 0], axis=0).sum(1)

Example Usage:

import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 2, 30).reshape(10, 3),
                  columns=["score", "svm_pred", "dnn_pred"])

s = df.iloc[:, 1:].ne(df.iloc[:, 0], axis=0).sum(1)

# Return rows where all classifers got it right
df[s.eq(0)]

   score  svm_pred  dnn_pred
2      0         0         0
8      1         1         1
9      1         1         1

# Return rows where 1 classifer got it wrong
df[s.eq(1)]

   score  svm_pred  dnn_pred
0      0         1         0
1      0         0         1
4      0         0         1
6      1         0         1

# Return rows where all classifers got it wrong
df[s.eq(2)]

   score  svm_pred  dnn_pred
3      1         0         0
5      0         1         1
7      0         1         1
Chris Adams
  • 18,389
  • 4
  • 22
  • 39