1

I have created a Pandera validation schema for a Pandas dataframe with ~150 columns, like the first two rows in the schema below. The single column validation is working, but how can I combine two or more columns for validation? I found two related questions here and here, but I still don't manage to build a valid schema.

import pandas as pd
import numpy as np
import pandera as pa

df = pd.DataFrame({'preg': [1, 0, 0, np.nan], 'nr_preg': [2, np.nan, 1, np.nan]})

schema = pa.DataFrameSchema({
    'preg': pa.Column(float, pa.Check.isin([1, 0]), nullable=True),
    'nr_preg': pa.Column(float, pa.Check.in_range(1, 10), nullable=True),
    # ...
    # not working:
    # if preg=0 -> nr_preg must be NaN
    'preg': pa.Column(float, pa.Check(lambda s: s['preg'] == 0 & s['nr_preg'].isnull() == False), nullable=True)
})

UPDATE
Now I have this solution.

df = pd.DataFrame({'preg': [1, 0, 0], 'nr_preg': [2, np.nan, 1], 'x': [1, 2, 3], 'y': [1, 2, 3]})
schema = pa.DataFrameSchema(
    # single columns checks
    columns={
        'preg': pa.Column(int, pa.Check.isin([1, 0]), nullable=True),
        'nr_preg': pa.Column(float, pa.Check.in_range(1, 10), nullable=True),
    },
    # combined column checks
    checks=[
        pa.Check(lambda df: ~((df['preg'].isin([np.nan, 0])) & (
            df['nr_preg'] > 0)), ignore_na=False, error="Error_A")
    ])

However, it also lists the variables x and y which are not checked and which I am not interested in. Error_A does not apply here. How can I remove them from the result?

0  DataFrameSchema     preg  Error_A             0           0.0      2
1  DataFrameSchema  nr_preg  Error_A             0           1.0      2
2  DataFrameSchema        x  Error_A             0           3.0      2
3  DataFrameSchema        y  Error_A             0           3.0      2
wl_
  • 37
  • 7

1 Answers1

0

This appears to be expected behavior. A workaround appears in this GitHub issue here.

Effectively, you need to group unique errors by their Index and select the check column. This isn't ideal though because you are unable to see the column to track the data that failed if you wanted to track that. You can make your errors more specific to try and assist.

The good news is that this appears to be being worked on, though no eta.

Workaround Example:

def df_validate(df: pd.DataFrame, schema: pa.DataFrameSchema) -> None:
    try: 
        schema.validate(df, lazy=True)

    except pa.errors.SchemaErrors as schema_errors: 
        print("Schema errors and failure cases:")
        print(schema_errors.failure_cases.groupby('index')["check"].unique())
        # Processs Your Errors


TYPKRFT
  • 158
  • 3
  • 14