How do I validate a value in a dataframe which is dependent on other value in that specific row?

Question

Suppose I have a .csv which follows this format:

Name, Salary, Department, Mandatory

Rob, 5500, Aviation, Yes

Bob, 1000, Facilities, No

Tom, 6000, IT, Yes

After exporting this to pandas/modin, I'd like to perform row-differentiated checks, where:

People named Rob working in aviation cannot earn less than 5000
People named Bob working in facilities cannot earn less than 1000
Whoever works in facilities has to report their salary, while people working in aviation or IT can choose to leave their salary unreported.
If any check is violated, we store this in a dataframe and pass forward this case to the human resources department for further investigation.

How would you validate this .csv using Pandera?

score 1 · Accepted Answer · answered Dec 21 '21 at 01:54

Depending on which API you're using, you can check out the wide checks for the object-based API or dataframe checks for the class-based API.

Note: the code snippets below aren't tested, but should be going in the right direction

Class-based API:

import pandera as pa
from pandera.typing as Series

class Schema(pa.SchemaModel):
    Name: Series[str]
    Salary: Series[int]
    Department: Series[str]
    Mandatory: Series[str]
    
    @pa.dataframe_check
    def rob_aviation_check(cls, df) -> Series[bool]:
        return df.loc[df["Name"] == "Rob" & df["Department"] == "Aviation", "Salary"] >= 5000

Object-based API:

schema = DataFrameSchema(
    columns={
        "Name": pa.Column(str),
        "Salary": Pa.Column(int),
        ...
    }
    checks=[
        pa.Check(lambda df: df.loc[df["Name"] == "Rob" & df["Department"] == "Aviation", "Salary"] >= 5000)
    ]
)

How do I validate a value in a dataframe which is dependent on other value in that specific row?

1 Answers1

Linked