2

Need

I am trying to export a dataframe to a Parquet file, which will be consumed later in the pipeline by something that is not Python or Pandas. (Azure Data Factory)

When I ingest the Parquet file later in the flow, it cannot recognize datetime64[ns]. I would rather just use "vanilla" Python datetime.datetime.

Problem

But I cannot manage to do this. The problem is that Pandas is forcing any "datetime-like object into datetime64[ns] once it is back in a dataframe or series.

Small Example

For instance, assume the iris dataset with a "timestamp" column:

>>> df.head()
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)   class                  timestamp
0                5.1               3.5                1.4               0.2  setosa 2021-02-19 15:07:24.719272
1                4.9               3.0                1.4               0.2  setosa 2021-02-19 15:07:24.719272
2                4.7               3.2                1.3               0.2  setosa 2021-02-19 15:07:24.719272
3                4.6               3.1                1.5               0.2  setosa 2021-02-19 15:07:24.719272
4                5.0               3.6                1.4               0.2  setosa 2021-02-19 15:07:24.719272

>>> df.dtypes
sepal length (cm)           float64
sepal width (cm)            float64
petal length (cm)           float64
petal width (cm)            float64
class                      category
timestamp            datetime64[ns]
dtype: object

I can convert a value to a "normal Python datetime":

>>> df.timestamp[1]
Timestamp('2021-02-19 15:07:24.719272')
>>> type(df.timestamp[1])
<class 'pandas._libs.tslibs.timestamps.Timestamp'>

>>> df.timestamp[1].to_pydatetime()
datetime.datetime(2021, 2, 19, 15, 7, 24, 719272)
>>> type(df.timestamp[1].to_pydatetime())
<class 'datetime.datetime'>

But I cannot "keep" it in that type, when I convert the entire column / series:

>>> df['ts2'] = df.timestamp.apply(lambda x: x.to_pydatetime())
>>> df.dtypes
sepal length (cm)           float64
sepal width (cm)            float64
petal length (cm)           float64
petal width (cm)            float64
class                      category
timestamp            datetime64[ns]
ts2                  datetime64[ns]

Possible Solutions

I looked to see if there were anything I could do to "dumb down" the dataframe column and make its datetimes less precise. But I cannot see anything. Nor can I see an option to specify column data types upon export via the df.to_parquet() method.

Is there a way to create a plain Python datetime.datetime column (not the Numpy/Pandas datetime65[ns] column) in a Pandas dataframe?

Mike Williamson
  • 4,915
  • 14
  • 67
  • 104
  • or try meddling with classes and inheritance – Luca Sans S Feb 19 '21 at 14:38
  • look up the super() function in python – Luca Sans S Feb 19 '21 at 14:38
  • Just curious as why you would want Python's datetime inside a dataframe as it doesn't play very nice with Pandas, turning the series into object. – Quang Hoang Feb 19 '21 at 14:39
  • 1
    while you're working in `pandas`, you'll want to keep datetime64 dtype. if I remember correctly, I've seen a similar question here before and I don't think there was an option to *force* a datetime.datetime column to be exported to parquet. Maybe using an ISO format string is an option? Of course, the tools you use on the other side will have to parse to some date/time dtype again I guess... – FObersteiner Feb 19 '21 at 14:49
  • 1
    @QuangHoang, I need the `datetime` simply because few tools outside of numpy/pandas -- and specifically my tool, Azure Data Factory -- understand the `datetime64[ns]` format. While still within Python, sure, I like that format and I am fine with it. But this script is just part of an ELT chain. – Mike Williamson Feb 22 '21 at 10:31

1 Answers1

2

Try to force the dtype='object' when you use to_pydatetime:

df['ts'] = pd.Series(df.timestamp.dt.to_pydatetime(),dtype='object')

df.loc[0,'ts']

Output:

datetime.datetime(2021, 2, 19, 15, 7, 24, 719272)
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • Hi Quang. Thanks for the answer! This did not *precisely* solve my problem. As I showed above: I *can* get the series converted to datetime, but then the dataframe converts the *column* back to `datetime64[ns]` automatically. But you got me thinking of the eventual solution: just force it to a string in the standard Asian-style date format. That worked. :) – Mike Williamson Feb 22 '21 at 10:34
  • @MikeWilliamson as shown in the output. The **series** with object dtype preserves datetime timestamp, even when you concatenation it back to the dataFrame ( as the `ts` column). I did not try to override original column, but I’m pretty sure it should behave the same. – Quang Hoang Feb 22 '21 at 10:37