I have an issue with pandas, pyarrow and bigsql. Basicly i want to read csv , write it to parquet and read parquet from db. But there is a problem while reading file from db. Pandas dtype and db datatype are not the same.
There is table created in bigsql like :
CREATE EXTERNAL hadoop TABLE sch.test (
id bigint ,
integer_column integer
)
I have csv file and loading it into pandas
csv = pd.read_csv(x)
print(csv.head(20))
id integer_column
0 200001 0.0
1 200002 0.0
2 200003 0.0
3 200004 0.0
4 200005 0.0
5 200006 0.0
6 200007 0.0
7 200008 0.0
8 200009 0.0
9 200010 0.0
10 200011 0.0
11 200012 0.0
12 200013 0.0
13 200014 0.0
14 200015 0.0
15 200016 NaN
16 200017 NaN
17 200018 NaN
18 200019 NaN
19 200020 NaN
Pure csv looks like this
id,integer_column
200001,0.0
200002,0.0
200016,
Then im trying to write parquet file from that through pyarrow.parquet.ParquetWriter like
import pyarrow.parquet as pq
csv = pd.read_csv(x)
table = pa.Table.from_pandas(csv)
pqwriter = pq.ParquetWriter('./filename.prq', table.schema,coerce_timestamps='ms', \
allow_truncated_timestamps=True,flavor='spark')
pqwriter.write_table(table)
pqwriter.close()
print(pa.Schema.from_pandas(csv ))
id: int64
integer_column : double
metadata
--------
{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
b'stop": 25000, "step": 1}], "column_indexes": [{"name": null, "fi'
b'eld_name": null, "pandas_type": "unicode", "numpy_type": "object'
b'", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "id
b, "field_name": "id", "pandas_type": "int64", "nu'
b'mpy_type": "int64", "metadata": null}, {"name": "integer_column '
b'medelta", "field_name": "integer_column", "pandas_type":'
b' "float64", "numpy_type": "float64", "metadata": null}], "creato'
b'r": {"library": "pyarrow", "version": "0.15.1"}, "pandas_version'
b'": "0.25.3"}'}
So i want to cast it integer_column as integer and get rid of floating point and then write it into parquet which hive/bigsql will be able to read. When I insert it like I showed there is an error :
bigsql : he statement failed because a Big SQL component encountered an error. Component receiving the error: "BigSQL IO". Component returning the error: "UNKNOWN". SQLCODE=-5105, SQLSTATE=58040
hive : SQL Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.io.DoubleWritable cannot be cast to org.apache.hadoop.io.IntWritable
I cant force pandas or pyarrow to cast double to integer properly. I know it is double becouse there are null values and its pandas natural behaviour, but how can i insert it properly into db as an integer with null values?
Only working solution is alter db column as double type or make fillna(0).astype(np.int32) on this pandas column - but i loose information here.
If you have any idea please share your knowladge. Thanks.
Resolve : pandas/pyarrow upgrade and code below helped.
csv['integer_column'] = csv['integer_column'].astype('Int64')