3

I'm using Pyarrow, Pyarrow.Parquet as well as Pandas. When I send a Pandas datetime64[ns] series to a Parquet file and load it again via a drill query, the query shows an Integer like: 1467331200000000 which seems to be something else than a UNIX timestamp.

The query looks like this:

SELECT workspace.id-column AS id-column, workspace.date-column AS date-column

When I open that file within Python again, it loads correctly and still has its datetime64[ns] type.

Any idea what's going wrong and how to solve this? I want this value being shown as a regular date.

Christian
  • 515
  • 1
  • 6
  • 17

2 Answers2

3

Ok, I found a solution some days ago which I would like to share. I think I initially missed something. It's very important to downcast to [ms] as well as allowing truncating timestamps before sending the dataframe to Parquet for becoming able to open it issue free in Drill:

pq.write_table(table, rf'{name}.parquet',
           coerce_timestamps='ms',
           allow_truncated_timestamps=True)

When I define a view in Drill I can cast that column as date or timestamp as required.

Christian
  • 515
  • 1
  • 6
  • 17
  • 1
    +1 and I will add that for a Pandas dataframe the shorthand `df.to_parquet('/some/path', coerce_timestamps='ms')` is sufficient. – Dzamo Norton Nov 11 '20 at 15:33
0

Could you please share parquet schema? Is it contains TIMESTAMP logical type for this column?

For now, you can write your custom UDF which converts BigInt nanoseconds value to timestamp, or use built-in functions (not sure about ns, but there are a lot of functions which accept millis):

select to_timestamp(1467331200000/1000);
+-----------------------+
|        EXPR$0         |
+-----------------------+
| 2016-07-01 03:00:00.0 |
+-----------------------+
Vova Vysotskyi
  • 661
  • 4
  • 8
  • Hey, in between I found a solution which I have to integrate in my `CREATE VIEW` statement: Basically I always want to create a view which contains all available columns, pure as they are, since they are already cleaned when written as Parquet. But some of these shall get a `CAST(xyz TO_TIMESTAMP)`. `SELECT *, CAST(xyz TO_TIMESTAMP) FROM workspace.file.paquet` doesn't work. No surprise. But do you have any idea how to handle it? – Christian Aug 05 '19 at 13:43
  • Please provide parquet schema to determine whether Drill should or shouldn't read this value as a timestamp. Regarding the query you have provided, it should work, but as it is expected, it will return the original column and converted column. – Vova Vysotskyi Aug 05 '19 at 17:43