Need
I am trying to export a dataframe to a Parquet file, which will be consumed later in the pipeline by something that is not Python or Pandas. (Azure Data Factory)
When I ingest the Parquet file later in the flow, it cannot recognize datetime64[ns]
. I would rather just use "vanilla" Python datetime.datetime
.
Problem
But I cannot manage to do this. The problem is that Pandas is forcing any "datetime-like object into datetime64[ns]
once it is back in a dataframe or series.
Small Example
For instance, assume the iris dataset with a "timestamp"
column:
>>> df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) class timestamp
0 5.1 3.5 1.4 0.2 setosa 2021-02-19 15:07:24.719272
1 4.9 3.0 1.4 0.2 setosa 2021-02-19 15:07:24.719272
2 4.7 3.2 1.3 0.2 setosa 2021-02-19 15:07:24.719272
3 4.6 3.1 1.5 0.2 setosa 2021-02-19 15:07:24.719272
4 5.0 3.6 1.4 0.2 setosa 2021-02-19 15:07:24.719272
>>> df.dtypes
sepal length (cm) float64
sepal width (cm) float64
petal length (cm) float64
petal width (cm) float64
class category
timestamp datetime64[ns]
dtype: object
I can convert a value to a "normal Python datetime":
>>> df.timestamp[1]
Timestamp('2021-02-19 15:07:24.719272')
>>> type(df.timestamp[1])
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
>>> df.timestamp[1].to_pydatetime()
datetime.datetime(2021, 2, 19, 15, 7, 24, 719272)
>>> type(df.timestamp[1].to_pydatetime())
<class 'datetime.datetime'>
But I cannot "keep" it in that type, when I convert the entire column / series:
>>> df['ts2'] = df.timestamp.apply(lambda x: x.to_pydatetime())
>>> df.dtypes
sepal length (cm) float64
sepal width (cm) float64
petal length (cm) float64
petal width (cm) float64
class category
timestamp datetime64[ns]
ts2 datetime64[ns]
Possible Solutions
I looked to see if there were anything I could do to "dumb down" the dataframe column and make its datetimes less precise. But I cannot see anything. Nor can I see an option to specify column data types upon export via the df.to_parquet()
method.
Is there a way to create a plain Python datetime.datetime
column (not the Numpy/Pandas datetime65[ns]
column) in a Pandas dataframe?