0

What's the best way to validate that the values in a column in a data frame match those in a given list (and, if not, are set to a value accordingly)?

I'm trying to replace invalid values for a field ("B") with the value "(not set)".

Here's what I've tried. The goal is to replace 'f' in 'B' with "(not set)". The trouble is validating against a list of values rather than a single boolean expression.

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6], 'B': ['a', 'b', 'f', 'a', 'b', 'c']})
valid_values = ['a', 'b', 'c', 'd']
validated_data = df.'B' not in [valid_values]
column_name = 'B'
df.loc[validated_data, column_name] = "(not set)"

The above taken from here: Conditional Replace Pandas

I've also tried:

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6], 'B': ['a', 'b', 'f', 'a', 'b', 'c']})
valid_values = ['a', 'b', 'c', 'd']

for x in df.B:
    if x not in valid_values:
        df.B = "(not set)"
...

However, for both I get the error "The truth value of a Series is ambiguous." I tried the recommendations at the following post to no avail.

Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

Other recommendations greatly appreciated!

le_crease
  • 69
  • 1
  • 9
  • You will get more and better answers if you create a [Minimal, Complete, and Verifiable](http://stackoverflow.com/help/mcve) example. Especially make sure that the input and expected data are complete (not pseudo-data), and can be easily cut and and paste into an editor to allow testing proposed solutions. – Stephen Rauch Jan 03 '18 at 03:42
  • 1
    Thank you. Error's answer below does this, but I will also edit the above for future passers by. – le_crease Jan 03 '18 at 17:54

1 Answers1

1

You need to use the isin method of Series and DataFrame objects.

Consider this example:

df = pd.DataFrame({'A': [1, 2, 3,4,5,6], 'B': ['a', 'b', 'f','a','b','c']})

Resulting in this DataFrame:

    A   B
0   1   a
1   2   b
2   3   f
3   4   a
4   5   b
5   6   c

Let's say we want to remove invalid values in column B and those invallid values are a en b.

When we use the isin method we can run this code:

mask = df['B'].isin(['a','b'])
df.loc[mask,'B'] = 'not set'

Resulting in this DataFrame:

    A   B
0   1   not set
1   2   not set
2   3   f
3   4   not set
4   5   not set
5   6   c

The isin method checks if the values in a Series or DataFrame object are in the given iterable. See the docs for more information.

error
  • 2,356
  • 3
  • 23
  • 25
  • Great, thank you very much. I adjusted `mask = df['B'].isin(['a','b'])` to be `mask = ~df['B'].isin(['a','b'])` so that it is checking if a value is NOT in the given list. – le_crease Jan 03 '18 at 17:50