pandas filter rows by two column values with case insenstive

Question

I have a simple dataframe as follows:

Last Known Date ConfigredValue  ReferenceValue
0   24-Jun-17   False   FALSE
1   25-Jun-17   FALSE   FALSE
2   26-Jun-17   TRUE    FALSE
3   27-Jun-17   FALSE   FALSE
4   28-Jun-17   false   FALSE

If I do the following command

df=df[df['ConfigredValue']!=dfs['ReferenceValue']]

then I get as below

0   24-Jun-17   False   FALSE
2   26-Jun-17   TRUE    FALSE
4   28-Jun-17   false   FALSE

But I want the filter with case insensitive (case=False)

I want following output:

2   26-Jun-17   TRUE    FALSE

Please suggest, how to get filtered case insensitive data(case=false)

Mohamed Ali JAMAOUI · Accepted Answer · 2020-02-10T08:40:37.543

Option 1: convert to lowercase or to uppercase and compare

The simplest is to convert the two columns to lower (or to upper) before checking for equality:

df=df[df['ConfigredValue'].str.lower()!=df['ReferenceValue'].str.lower()]

or

df=df[df['ConfigredValue'].str.upper()!=df['ReferenceValue'].str.upper()]

output:

Out: 
  Last Known Date ConfigredValue ReferenceValue
2    2  26-Jun-17           TRUE          FALSE

Option 2: Compare the lengths

In this particuler case, you can simply compare the lengths of TRUE and True, they are the same wether the string is upper or lower case:

df[df['ConfigredValue'].str.len()!=df['ReferenceValue'].str.len()]

output:

Out: 
  Last Known Date ConfigredValue ReferenceValue
2    2  26-Jun-17           TRUE          FALSE

Option 3: Vectorized title

str.title() was also suggested in @0p3n5ourcE answer, here's a vectorized version of it:

df[df['ConfigredValue'].str.title()!=df['ReferenceValue'].str.title()]

Execution time

Benchmarking the speed shows that str.len() is a bit faster

In [35]: timeit df[df['ConfigredValue'].str.lower()!=df['ReferenceValue'].str.lower()]
1000 loops, best of 3: 496 µs per loop

In [36]: timeit df[df['ConfigredValue'].str.upper()!=df['ReferenceValue'].str.upper()]
1000 loops, best of 3: 496 µs per loop

In [37]: timeit df[df['ConfigredValue'].str.title()!=df['ReferenceValue'].str.title()]
1000 loops, best of 3: 495 µs per loop

In [38]: timeit df[df['ConfigredValue'].str.len()!=df['ReferenceValue'].str.len()]
1000 loops, best of 3: 479 µs per loop

score 3 · Answer 2 · answered Sep 25 '17 at 16:51

Better replace existing false with 'FALSE' with case = False parameter ie

df['ConfigredValue'] = df['ConfigredValue'].str.replace('false','FALSE',case=False)

df=df[df['ConfigredValue']!=df['ReferenceValue']]

Output:

   Last Known_Date ConfigredValue ReferenceValue
2     2  26-Jun-17           TRUE          FALSE

niraj · Answer 3 · 2017-09-25T17:19:19.350

2

Looks like the columns hold boolean values, if it is not a problem converting the columns to boolean datatype then, following can work too (where .title() is used to change first character of string to uppercase e.g. FALSE to False, or true to True which can be used to convert then to corresponding boolean value):

df['ConfigredValue'] = df['ConfigredValue'].apply(lambda row: eval(row.title()))
df['ReferenceValue'] = df['ReferenceValue'].apply(lambda row: eval(row.title()))

Then, using same comparison as above:

df[df['ConfigredValue'] != df['ReferenceValue']]

Output:

    Last Known Date  ConfigredValue  ReferenceValue
2       26-Jun-17            True           False

Or, simply using title only similar to uppercase or lowercase:

df[df['ConfigredValue'].str.title() !=df['ReferenceValue'].str.title()]

edited Sep 25 '17 at 17:19

answered Sep 25 '17 at 17:10

niraj

17,498
4
33
48

1

I'm glad someone used `str.title` – piRSquared Sep 25 '17 at 17:16
@piRSquared Thanks, but I am not sure if changing the datatype to boolean would be better practice. – niraj Sep 25 '17 at 17:20
Agreed. Ultimately, it's up to OP. You provided an option that you can follow up with `literal_eval` from the `ast` package to change to `bool`. – piRSquared Sep 25 '17 at 17:26

score 1 · Answer 4 · answered Sep 25 '17 at 17:16

Outside the box
pandas.read_csv reads all of these in as boolean. You can dump to csv and read it in again. Then you can use pd.DataFrame.query

pd.read_csv(pd.io.common.StringIO(df.to_csv(index=False))).query(
    'ConfigredValue != ReferenceValue')

  Last Known Date  ConfigredValue  ReferenceValue
2       26-Jun-17            True           False

pandas filter rows by two column values with case insenstive

4 Answers4

Option 1: convert to lowercase or to uppercase and compare

Option 2: Compare the lengths

Option 3: Vectorized title

Execution time

Linked

Related