0

I am trying to get some metrics on some data at my company.

Basically, I have this dataframe that I have titled rawData. rawData contains a number of columns, mostly of parameters I am interested in. The specifics of this are not too important I dont think, so we can just think of these as parameter1, parameter2, and so on.

There is an additional column, which I have titled overallResult. This column will always contain either the string PASS, or FAIL. I am trying to extract a sub-dataframe from my raw data based on the overallResult. It sounds simple enough, but I am messing up my implementation somehow.

I make my mask like this: mask = rawData[overallResult].eq(truthyVal), where in this case truthyVal is PASS

The mask is created successfully, but..

The mask is like this: filteredData = rawData[mask] and I would like filteredData to now contain everything that rawData does, but only on rows where truthyVal exists.

and it always give me this error: cannot reindex on an axis with duplicate labels.

From what I understand, the mask contains a boolean list of my overallResult column, true if truthyVal is found on that row, and false if not. I am pretty sure that I am not applying my mask correctly here. There must be some small extra step I am overlooking, and at this point I am frustrated because it seems so simple, so IDK, any ideas?

creosean
  • 13
  • 3

1 Answers1

0

You have the principle correct as the following basic example shows:

import pandas as pd

df = pd.DataFrame({'data': [ 1, 2, 3, 4, 5, 6],
                  'test': ['pass', 'fail', 'pass', 'fail','pass', 'fail']})

mask = df['test'].eq('pass')
print(df[mask])

To decipher your error message it would be necessary to see a data sample which produces it; you might get some useful insights here

user19077881
  • 3,643
  • 2
  • 3
  • 14
  • Yeah its probably something like that, thanks for the response. I wanted to make sure I was not missing anything obvious, which I thought was pretty likely. – creosean Jan 26 '23 at 01:25