Ingesting An Null Int Column: Pandas and Pandera

Question

I am using pandas with pandera for schema validation, but I've run into a problem since there's a null integer column in the data.

from prefect import task, Flow          #type:ignore
from pandera import Check, Column, DataFrameSchema
import prefect
import pandas as pd
import pandera as pa
import numpy as np


def pschema(d):
    logger = prefect.utilities.logging.get_logger()  # type: ignore
    engine = connect_db(prefect.config.kv.p.staging_db_constring, logger) #type:ignore


    table_name = "MyTable"
    org = "myOrg"

    k = {}
    df = pd.read_sql(
    f"SELECT NameStrNotQuoted, FieldTypeName, SizeStr, Precision, Scale FROM dbo.vw_cx_meta WHERE [Table] = '{table_name}' and Organization='{org}' AND ETL_Active = 1",
    engine,
    )
    for row in df.itertuples(index=False):
        if row.FieldTypeName == "int":
            k.update({row.NameStrNotQuoted:Column(int,Check(lambda x: pd.Series([x.fillna(0)],dtype='Int64')),coerce=True, nullable=True)})
        elif row.FieldTypeName == 'bit':
            k.update({row.NameStrNotQuoted:Column(pa.Bool, coerce=True)})
    sch = DataFrameSchema(k)

    sch.validate(d)
    return k

The errors:

ValueError: cannot convert float NaN to integer
.
.
.
  File "/usr/local/lib/python3.8/site-packages/pandera/schemas.py", line 1789, in coerce_dtype
    raise errors.SchemaError(
pandera.errors.SchemaError: Error while coercing 'CopySourceID' to type int64: Could not coerce <class 'pandas.core.series.Series'> data_container into type int64:

I am aware of pandas 'gotcha' with nulls in int columns and I've tried every permutation of using Check's lambda function to solve this. Any help would be appreciated, thanks.

I believe this is not a Pandera problem, but just a limitation of casting a column of floats with nulls to type Int. This is simply not possible, as I believe you already mentioned in your last comment of your post. You could put coerce=False, but the column will remain float of course. — flow_me_over, Apr 13 '22 at 07:54

score 6 · Answer 1 · answered May 05 '22 at 11:02

6

I just came across this, it's solvable by combination of Pandas nullable integer ("Int64" / https://pandas.pydata.org/docs/user_guide/integer_na.html) and coerce=True.

eg:

pa.Column("Int64",coerce=True)

answered May 05 '22 at 11:02

Michal

128
1
7

Had similar problem as in description of question when reading data from excel file and validating using `pandera.DataFrameSchema`. This answer helped. – devaerial Dec 12 '22 at 11:17

Ingesting An Null Int Column: Pandas and Pandera

1 Answers1