1

I am trying to set up a DataFrameSchema in Pandera. The catch is that one of the columns of data may be a float or an int, depending on what data source was used to create the dataframe. Is there a way to set up a check on such a column? This code failed:

import pandera as pa
from pandera.typing import DataFrame, Series
from datetime import datetime
import pandas as pd

class IngestSchema(pa.SchemaModel):
    column_header: Series[float | int] = pa.Field(alias = 'MY HEADER')

Other things I've tried:

from typing import Union
float_int = Union[float, int]

But pandera does not recognize that union as a datatype. Is there any way to set up such a schema?

wdchild
  • 51
  • 7

1 Answers1

1

Digging into their docs they have a is_numeric which checks if its a _Number datatype. But it's a private var atm so maybe someday down the line? In the meantime you can go with the suggested workaround:

from pandas.api.types import is_numeric_dtype
import pandera as pa
import pandas as pd

is_number = pa.Check(is_numeric_dtype, name="is_number")
schema = pa.DataFrameSchema({"column": pa.Column(checks=is_number)})
schema(pd.DataFrame({"column": [1,2,"a"]}))

I see you're using the SchemaModel which I'm not very familiar with. I tested this locally and it worked though (w caveat of uncertainty regarding the Series annotation:

import pandas as pd
import pandera as pa
from pandera.typing import Series
from pandas.api.types import is_numeric_dtype

class IngestSchema(pa.DataFrameModel):
    column_header: Series

    @pa.check("column_header")
    def check_is_number(cls, column_header: Series):
        return is_numeric_dtype(column_header)

# flags it
IngestSchema(pd.DataFrame({"column_header": [1, 2, "a"]}))

# passes
IngestSchema(pd.DataFrame({"column_header": [1, 2, 3]}))

Note that pa.DataFrameModel is the updated syntax and SchemaModel serves as an alias for it. SchemaModel will be deprecated in version 0.20.0 as mentioned in the docs.

neldeles
  • 588
  • 1
  • 5
  • 12