0

The following code is converting any kind of timestamp of dataframe into a given Format.

pd.to_datetime(df_pd["timestamp"]).dt.strftime('%Y-%m-%d %X')

How can I do this with "DASK"? I used the below code but it did not work.

(df is dask dataframe)

a=dd.to_datetime(df["time:timestamp"],format='%Y-%m-%d %X')
a.compute()

Error-: ValueError: unconverted data remains: .304000+00:00

this is how timestamp look like-: "2016-01-01 09:51:15.304000+00:00" (This could be any kind of format)

Expected output -: "2016-01-01 09:51:15"

I found Converting a Dask column into new Dask column of type datetime, but it is not working

Example with Pandas which works with any format-:


import pandas as pd
  

data = ['2016-01-01 09:51:15.304000+00:00','2016-01-01 09:51:15.304000+00:00','2016-01-01 09:51:15.304000+00:00','2016-01-01 09:51:15.304000+00:00']
data1 = ['2016-01-01 09:51:15','2016-01-01 09:51:15','2016-01-01 09:51:15','2016-01-01 09:51:15','2016-01-01 09:51:15']
data2 = ['2016-01-01','2016-01-01','2016-01-01','2016-01-01','2016-01-01']
  

df1 = pd.DataFrame(data2, columns=['t'])

df1['t']=pd.to_datetime(df1["t"]).dt.strftime('%Y-%m-%d %X')

Can someone tell me, how to do the same with "Dask"

Here is my solution

it could be done with following code-: dd.to_datetime(df["t"].compute()).dt.strftime('%Y-%m-%d %X')

but now the problem is that i can't store this conversion in the existing dataframe like i did with pandas.

if i do df["t"]=dd.to_datetime(df["t"].compute()).dt.strftime('%Y-%m-%d %X'), it throws an error.

ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.

this ValueError: Not all divisions are known, can't align partitions error on dask dataframe does not workk

Coder
  • 1,129
  • 10
  • 24

3 Answers3

1

As you already have the string in the almost correct format, maybe just with with the strings:

df_pd['timestamp'] = df_pd['timestamp'].str.replace(r'\..*', '', regex=True)

Alternatively, if you need to use to_datetime:

pd.to_datetime(df_pd["timestamp"]).dt.strftime('%Y-%m-%d %X')

Or:

pd.to_datetime(df_pd["timestamp"],format='%Y-%m-%d %H:%M:%S.%f%z').dt.strftime('%Y-%m-%d %X')
mozway
  • 194,879
  • 13
  • 39
  • 75
  • Thanks for the Answer. I think you did it with the "Pandas", But i want to do it with Dask. one solution is mentioned below, could you please take a look at answer and also the comment that i added. – Coder Jul 08 '22 at 13:10
  • I update the question as well, kindly take a look – Coder Jul 08 '22 at 14:24
1

You can truncate the datetime:

# Solution 1
>>> dd.to_datetime(df['time:timestamp'].str[:19]).compute()
0   2016-01-01 09:51:15
dtype: datetime64[ns]


# Solution 2
>>> dd.to_datetime(df['time:timestamp'].str.split('.').str[0]).compute()
0   2016-01-01 09:51:15
dtype: datetime64[ns]


# Solution 3 (@mozway)
>>> dd.to_datetime(df['time:timestamp'].str.replace('\..*', '', regex=True)).compute()
0   2016-01-01 09:51:15
dtype: datetime64[ns]
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • Thanks for the answer. But does it work for all kinds of timestamps? The one I mentioned with "pandas" can accept any kind of format and convert it into the "Expected timestamp" that I added there. – Coder Jul 08 '22 at 13:08
  • I update the question as well, kindly take a look – Coder Jul 08 '22 at 14:24
0

Here is how i did it

df["time:timestamp"]=dd.to_datetime(df["time:timestamp"]).dt.strftime('%Y-%m-%d %X')

df.compute()```
Coder
  • 1,129
  • 10
  • 24