1

I have a Dask data frame that has two columns, a date and a value.

I store it like so:

ddf.to_parquet('/some/folder', engine='pyarrow', overwrite=True)

I'm expecting Dask to store the date column as date in Parquet, but when I query it with Apache Drill I get 16 digit numbers (I would say timestamps) instead of dates. For example I get:

1546300800000000 instead of 2019-01-01

1548979200000000 instead of 2019-02-01

Is there a way to tell Dask to store columns as dates? How can I run a select with Apache Drill and get the dates? I tried using SELECT CAST in Drill but it doesn't convert the numbers to dates.

ps0604
  • 1,227
  • 23
  • 133
  • 330

3 Answers3

3

Not sure if is relevant for you, but it seems that you are interested only in the date value (ignoring hours, minutes, etc.). If so, you can explicitly convert timestamp information into date string using .dt.date.

import pandas as pd
import dask.dataframe as dd

sample_dates = [
    '2019-01-01 00:01:00',
    '2019-01-02 05:04:02',
    '2019-01-02 15:04:02'
]

df = pd.DataFrame(zip(sample_dates, range(len(sample_dates))), columns=['datestring', 'value'])

ddf = dd.from_pandas(df, npartitions=2)

# convert to timestamp and calculate as unix time (relative to 1970)
ddf['unix_timestamp_seconds'] = (ddf['datestring'].astype('M8[s]') - pd.to_datetime('1970-01-01')).dt.total_seconds()

# convert to timestamp format and extract dates
ddf['datestring'] = ddf['datestring'].astype('M8[s]').dt.date

ddf.to_parquet('test.parquet', engine='pyarrow', write_index=False, coerce_timestamps='ms')

For time conversion, you can use .astype or dd.to_datetime, see answers to this question. There is also a very similar question and answer, which suggests that ensuring that the timestamp is downcasted to ms resolves the issue.

So playing around with the values you provided it's possible to see that the core problem is a mismatch in the scaling of the variable:

# both yield: Timestamp('2019-01-01 00:00:00')

pd.to_datetime(1546300800000000*1000, unit='ns')
pd.to_datetime(1546300800000000/1000000, unit='s')
SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • Thanks, but the input is not a string, it's a date, before storing the data frame in parquet I convert the column like this `df[dateColumn] = pd.to_datetime(df[dateColumn])` and when it's stored in parquet I still get the timestamp as an INT64. Is there a function in Drill to convert from INT64 to DATE? – ps0604 Apr 05 '21 at 10:38
  • Also, you might find this useful: https://stackoverflow.com/a/57508904/10693596 – SultanOrazbayev Apr 05 '21 at 10:53
  • If I add `unit='s'` I get the error `ValueError: unit='s' not valid with non-numerical val='2019-01-01'` – ps0604 Apr 05 '21 at 10:58
  • 1
    Thanks, the link that you posted had the answer, storing parquet with `coerce_timestamps='ms'` then running a select in Drill returns a formatted date – ps0604 Apr 05 '21 at 11:14
1

If memory serves, Drill uses an old non-standard of INT96 time stamps, which was never supported by parquet. A parquet timestamp is essentially a UNIX timestamp, as an int64, and with various precision. Drill must have a function to correctly convert this its internal format.

I am no expert on Drill, but it seems you need to first divide your integer by the appropriate power of 10, (see this answer). This syntac is probably wrong, but might give you the idea:

SELECT TO_TIMESTAMP((mycol as FLOAT) / 1000) FROM ...;
mdurant
  • 27,272
  • 5
  • 45
  • 74
1

Here's a link to the Drill docs about the TO_TIMESTAMP() function. (https://drill.apache.org/docs/data-type-conversion/#to_timestamp) I think @mdurant is correct in his approach.

I would try either:

SELECT TO_TIMESTAMP(<date_col>) FROM ...

or

SELECT TO_TIMSTAMP((<date_col> / 1000)) FROM ...
cgivre
  • 513
  • 4
  • 21