5

I have a simple dataframe as follows:

Last Known Date ConfigredValue  ReferenceValue
0   24-Jun-17   False   FALSE
1   25-Jun-17   FALSE   FALSE
2   26-Jun-17   TRUE    FALSE
3   27-Jun-17   FALSE   FALSE
4   28-Jun-17   false   FALSE

If I do the following command

df=df[df['ConfigredValue']!=dfs['ReferenceValue']]

then I get as below

0   24-Jun-17   False   FALSE
2   26-Jun-17   TRUE    FALSE
4   28-Jun-17   false   FALSE

But I want the filter with case insensitive (case=False)

I want following output:

2   26-Jun-17   TRUE    FALSE

Please suggest, how to get filtered case insensitive data(case=false)

frogatto
  • 28,539
  • 11
  • 83
  • 129
Python Spark
  • 303
  • 2
  • 6
  • 16

4 Answers4

8

Option 1: convert to lowercase or to uppercase and compare

The simplest is to convert the two columns to lower (or to upper) before checking for equality:

df=df[df['ConfigredValue'].str.lower()!=df['ReferenceValue'].str.lower()]

or

df=df[df['ConfigredValue'].str.upper()!=df['ReferenceValue'].str.upper()]

output:

Out: 
  Last Known Date ConfigredValue ReferenceValue
2    2  26-Jun-17           TRUE          FALSE

Option 2: Compare the lengths

In this particuler case, you can simply compare the lengths of TRUE and True, they are the same wether the string is upper or lower case:

df[df['ConfigredValue'].str.len()!=df['ReferenceValue'].str.len()]

output:

Out: 
  Last Known Date ConfigredValue ReferenceValue
2    2  26-Jun-17           TRUE          FALSE

Option 3: Vectorized title

str.title() was also suggested in @0p3n5ourcE answer, here's a vectorized version of it:

df[df['ConfigredValue'].str.title()!=df['ReferenceValue'].str.title()]

Execution time

Benchmarking the speed shows that str.len() is a bit faster

In [35]: timeit df[df['ConfigredValue'].str.lower()!=df['ReferenceValue'].str.lower()]
1000 loops, best of 3: 496 µs per loop

In [36]: timeit df[df['ConfigredValue'].str.upper()!=df['ReferenceValue'].str.upper()]
1000 loops, best of 3: 496 µs per loop

In [37]: timeit df[df['ConfigredValue'].str.title()!=df['ReferenceValue'].str.title()]
1000 loops, best of 3: 495 µs per loop

In [38]: timeit df[df['ConfigredValue'].str.len()!=df['ReferenceValue'].str.len()]
1000 loops, best of 3: 479 µs per loop
Mohamed Ali JAMAOUI
  • 14,275
  • 14
  • 73
  • 117
3

Better replace existing false with 'FALSE' with case = False parameter ie

df['ConfigredValue'] = df['ConfigredValue'].str.replace('false','FALSE',case=False)

df=df[df['ConfigredValue']!=df['ReferenceValue']]

Output:

   Last Known_Date ConfigredValue ReferenceValue
2     2  26-Jun-17           TRUE          FALSE
Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108
2

Looks like the columns hold boolean values, if it is not a problem converting the columns to boolean datatype then, following can work too (where .title() is used to change first character of string to uppercase e.g. FALSE to False, or true to True which can be used to convert then to corresponding boolean value):

df['ConfigredValue'] = df['ConfigredValue'].apply(lambda row: eval(row.title()))
df['ReferenceValue'] = df['ReferenceValue'].apply(lambda row: eval(row.title()))

Then, using same comparison as above:

df[df['ConfigredValue'] != df['ReferenceValue']]

Output:

    Last Known Date  ConfigredValue  ReferenceValue
2       26-Jun-17            True           False

Or, simply using title only similar to uppercase or lowercase:

df[df['ConfigredValue'].str.title() !=df['ReferenceValue'].str.title()]
niraj
  • 17,498
  • 4
  • 33
  • 48
1

Outside the box
pandas.read_csv reads all of these in as boolean. You can dump to csv and read it in again. Then you can use pd.DataFrame.query

pd.read_csv(pd.io.common.StringIO(df.to_csv(index=False))).query(
    'ConfigredValue != ReferenceValue')

  Last Known Date  ConfigredValue  ReferenceValue
2       26-Jun-17            True           False
piRSquared
  • 285,575
  • 57
  • 475
  • 624