0

I have an issue with pandas, pyarrow and bigsql. Basicly i want to read csv , write it to parquet and read parquet from db. But there is a problem while reading file from db. Pandas dtype and db datatype are not the same.

There is table created in bigsql like :

CREATE EXTERNAL hadoop TABLE sch.test (
  id bigint ,
  integer_column integer
)

I have csv file and loading it into pandas

csv = pd.read_csv(x)
print(csv.head(20))

           id          integer_column 
0      200001                     0.0
1      200002                     0.0
2      200003                     0.0
3      200004                     0.0
4      200005                     0.0
5      200006                     0.0
6      200007                     0.0
7      200008                     0.0
8      200009                     0.0
9      200010                     0.0
10     200011                     0.0
11     200012                     0.0
12     200013                     0.0
13     200014                     0.0
14     200015                     0.0
15     200016                     NaN
16     200017                     NaN
17     200018                     NaN
18     200019                     NaN
19     200020                     NaN

Pure csv looks like this

id,integer_column 
200001,0.0
200002,0.0
200016,

Then im trying to write parquet file from that through pyarrow.parquet.ParquetWriter like

    import pyarrow.parquet as pq
    csv = pd.read_csv(x)
    table = pa.Table.from_pandas(csv)
    pqwriter = pq.ParquetWriter('./filename.prq', table.schema,coerce_timestamps='ms', \
               allow_truncated_timestamps=True,flavor='spark')            
    pqwriter.write_table(table)
    pqwriter.close()
    print(pa.Schema.from_pandas(csv ))
id: int64
integer_column : double
metadata
--------
{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
            b'stop": 25000, "step": 1}], "column_indexes": [{"name": null, "fi'
            b'eld_name": null, "pandas_type": "unicode", "numpy_type": "object'
            b'", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "id
            b, "field_name": "id", "pandas_type": "int64", "nu'
            b'mpy_type": "int64", "metadata": null}, {"name": "integer_column '
            b'medelta", "field_name": "integer_column", "pandas_type":'
            b' "float64", "numpy_type": "float64", "metadata": null}], "creato'
            b'r": {"library": "pyarrow", "version": "0.15.1"}, "pandas_version'
            b'": "0.25.3"}'}

So i want to cast it integer_column as integer and get rid of floating point and then write it into parquet which hive/bigsql will be able to read. When I insert it like I showed there is an error :

bigsql : he statement failed because a Big SQL component encountered an error. Component receiving the error: "BigSQL IO". Component returning the error: "UNKNOWN". SQLCODE=-5105, SQLSTATE=58040

hive : SQL Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.io.DoubleWritable cannot be cast to org.apache.hadoop.io.IntWritable

I cant force pandas or pyarrow to cast double to integer properly. I know it is double becouse there are null values and its pandas natural behaviour, but how can i insert it properly into db as an integer with null values?

Only working solution is alter db column as double type or make fillna(0).astype(np.int32) on this pandas column - but i loose information here.

If you have any idea please share your knowladge. Thanks.

Resolve : pandas/pyarrow upgrade and code below helped.

csv['integer_column'] = csv['integer_column'].astype('Int64')

Chan
  • 23
  • 2
  • 6

1 Answers1

0

Pandas 0.24.0 and greater has support for nullable integer columns so you can convert in pandas.

csv['integer_column'] = csv['integer_column'].astype('Int64')

Or you can use pyarrow compute functions to cast from float to int in pyarrow.

table = table.set_column(1, table.column_names[1], pc.cast(table.column(1), pa.int64()))
Pace
  • 41,875
  • 13
  • 113
  • 156
  • Basically it helped, but only for bigsql. Hiveserver still cant load this value. Error : SQL Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.IntWritable. This column has max value `print(selected['integer_column'].max())` `4534` I don,t know why hive cant load this as an integer. When I change column datatype to bigint it all gone of course. – Chan Mar 08 '21 at 12:28