Is it possible to validate a column based on another column using Pandera?
My dataframe looks like this:
df = pd.DataFrame({
"Name": ["Thomas","",""],
"Address": ["Address 1", "Address 1", "Address 3"],
"Zip": ["65989", "65989", "65954"],
"External": [False, True, False],
})
I would like to validate the "Name" column based on the "External" column. If external = True then Name can be empty. In this example, the third record should be invalid because external = False and the name is missing.
A related question here suggests to use the wide checks. However, in this way all the columns of the third record are evaluated as invalid (Name, Address, Zip and External) but I need only the Name to be invalid and ignore the rest.
schema_ = pa.DataFrameSchema({
"Name": pa.Column(str),
"Address": pa.Column(str),
"Zip": pa.Column(str),
"External": pa.Column(str),
},
checks=[pa.Check.is_external()])
@extensions.register_check_method(check_type="element_wise",)
def is_external(pandas_obj: pd.Series):
if (pandas_obj["external"] == True) and (len(pandas_obj["Name"]))<1:
return False
else:
return True
I also tried something like this in the schema:
"Name": pa.Column(str, checks=[
pa.Check.is_external(df["external"]),
]),
But in this case all the values are passed to the function "[False, True, False]" and I am not sure how to compare them to the corresponding values of the Name column.
Is it possible to make this kind of checks in Pandera ? Thanks in advance!