Sending Pandas Dataframe with Int64 type to GCP Spanner INT64 column

Question

I am using Pandas Dataframes. I have a column from a CSV which is integers mixed in with nulls.

I am trying to convert this and insert it into Spanner in as generalizable a way as possible(so I can use the same code for future jobs), which reduces my ability to use sentinel variables. However, DFs cannot handle NaNs in a pure int column so you have to use Int64. When I try to insert this into Spanner I get an error that it is not an int64 type, whereas pure Python ints do work. Is there an automatic way to convert Int64 Pandas values to int values during the insert? Converting the column before inserting doesn't work, again, because of the null values. Is there another path around this?

Trying to convert from a Series goes like so:

>>>s2=pd.Series([3.0,5.0])
>>>s2
0    3.0
1    5.0
dtype: float64
>>>s1=pd.Series([3.0,None])
>>>s1
0    3.0
1    NaN
dtype: float64
>>>df = pd.DataFrame(data=[s1,s2], dtype=np.int64)
>>>df
   0    1
0  3  NaN
1  3  5.0
>>>df = pd.DataFrame(data={"nullable": s1, "nonnullable": s2}, dtype=np.int64)

this last command produces the error ValueError: Cannot convert non-finite values (NA or inf) to integer

score 0 · Answer 1 · answered Mar 26 '19 at 16:48

I was unable to reproduce your issue but it seems everyone works as expected

Is it possible you have a non-nullable column that you are writing null values to?

Retrieving the schema of a Spanner table

from google.cloud import spanner

client = spanner.Client()
database = client.instance('testinstance').database('testdatabase')
table_name='inttable'

query = f'''
SELECT
t.column_name,
t.spanner_type,
t.is_nullable
FROM
information_schema.columns AS t
WHERE
t.table_name = '{table_name}'
'''

with database.snapshot() as snapshot:
    print(list(snapshot.execute_sql(query)))
    # [['nonnullable', 'INT64', 'NO'], ['nullable', 'INT64', 'YES']]

Inserting to spanner from a Pandas dataframe

from google.cloud import spanner

import numpy as np
import pandas as pd

client = spanner.Client()
instance = client.instance('testinstance')
database = instance.database('testdatabase')


def insert(df):
    with database.batch() as batch:
        batch.insert(
            table='inttable',
            columns=(
                'nonnullable', 'nullable'),
            values=df.values.tolist()
        )

print("Succeeds in inserting int rows.")
d = {'nonnullable': [1, 2], 'nullable': [3, 4]}
df = pd.DataFrame(data=d, dtype=np.int64)
insert(df)

print("Succeeds in inserting rows with None in nullable columns.")
d = {'nonnullable': [3, 4], 'nullable': [None, 6]}
df = pd.DataFrame(data=d, dtype=np.int64)
insert(df)

print("Fails (as expected) attempting to insert row with None in a nonnullable column fails as expected")
d = {'nonnullable': [5, None], 'nullable': [6, 0]}
df = pd.DataFrame(data=d, dtype=np.int64)
insert(df)
# Fails with "google.api_core.exceptions.FailedPrecondition: 400 nonnullable must not be NULL in table inttable."

I ran into a few problems which I put into the main post. Could you help me understand why this is the case with a Series when doing it with an array seems to work so well? Do I have to use an array? — WarSame, Mar 26 '19 at 18:16

score 0 · Accepted Answer · answered Mar 27 '19 at 17:29

0

My solution was to leave it as NaN(it turns out NaN == 'nan'). Then, at the very end, as I went to insert into the Spanner DB, I replaced all NaN with None in the DF. I used code from another SO answer: df.replace({pd.np.nan: None}). Spanner was looking at the NaN as a 'nan' string and rejecting that for insertion into an Int64 column. None is treated as NULL and can get inserted into Spanner with no issue.

answered Mar 27 '19 at 17:29

WarSame

870
1
10
24

StackOverflow answer for replacing `nan` with `None`: https://stackoverflow.com/a/54403705/101923 – Tim Swast Jul 21 '20 at 19:25

Sending Pandas Dataframe with Int64 type to GCP Spanner INT64 column

2 Answers2

Retrieving the schema of a Spanner table

Inserting to spanner from a Pandas dataframe