I have the following localize datetime series. When I am trying to save the date/time as second some datetimes getting saved as -1 instead of actual date. Interestingly this happens when time saving is happening, does anyone know how to fix this? here is the code.
test_time_df
here is the output of of test_time_df
Out[4]: Datetime
2022-03-13 01:40:17.999000-06:00 2022-03-13 01:40:17.999000-06:00
2022-03-13 01:45:19-06:00 2022-03-13 01:45:19-06:00
2022-03-13 01:50:18-06:00 2022-03-13 01:50:18-06:00
2022-03-13 01:55:18.999000-06:00 2022-03-13 01:55:18.999000-06:00
2022-03-13 03:00:28-05:00 2022-03-13 03:00:28-05:00
2022-03-13 03:05:20-05:00 2022-03-13 03:05:20-05:00
2022-03-13 03:10:17.999000-05:00 2022-03-13 03:10:17.999000-05:00
2022-03-13 03:15:18-05:00 2022-03-13 03:15:18-05:00
2022-03-13 03:20:20-05:00 2022-03-13 03:20:20-05:00
2022-03-13 03:25:18.999000-05:00 2022-03-13 03:25:18.999000-05:00
Name: Datetime, dtype: datetime64[ns, America/Chicago]
when I try to save it to csv thats what I get
test_time_df.to_csv('test_01.csv', date_format='%s')
when I try to read the data with pandas or in linux. I get the following results.
Datetime,Datetime
1647135617,1647135617
1647135919,1647135919
1647136218,1647136218
1647136518,1647136518
-1,-1
-1,-1
-1,-1
-1,-1
-1,-1
-1,-1
By the way I am using Databricks with 'pandas == 1.4' and numpy==1.20.1.
Here is additional information: the original dataframe is df_test with following input:
import datetime
df_test = pd.DataFrame({"Batch ID":[6409713, 6409715, 6409719, 6409721, 6409723, 6409731, 6409733,
6409735, 6409739, 6409741]," SCED Timestamp ":[datetime.datetime(2022, 3, 13, 1, 40, 17, 999000),
datetime.datetime(2022, 3, 13, 1, 45, 19),
datetime.datetime(2022, 3, 13, 1, 50, 18),
datetime.datetime(2022, 3, 13, 1, 55, 18, 999000),
datetime.datetime(2022, 3, 13, 3, 0, 28),
datetime.datetime(2022, 3, 13, 3, 5, 20),
datetime.datetime(2022, 3, 13, 3, 10, 17, 999000),
datetime.datetime(2022, 3, 13, 3, 15, 18),
datetime.datetime(2022, 3, 13, 3, 20, 20),
datetime.datetime(2022, 3, 13, 3, 25, 18, 999000)],
"Repeated Hour Flag": ['N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N'],
"System Lamda": [27.3663, 26.2636, 28.441, 27.1946, 19.4144, 26.5737, 30.4865,
29.5245, 29.6784, 28.6156] })
df_test
Batch ID SCED Timestamp Repeated Hour Flag System Lamda
20695 6409713 2022-03-13 01:40:17.999000 N 27.3663
20696 6409715 2022-03-13 01:45:19 N 26.2636
20697 6409719 2022-03-13 01:50:18 N 28.441
20698 6409721 2022-03-13 01:55:18.999000 N 27.1946
20699 6409723 2022-03-13 03:00:28 N 19.4144
20700 6409731 2022-03-13 03:05:20 N 26.5737
20701 6409733 2022-03-13 03:10:17.999000 N 30.4865
20702 6409735 2022-03-13 03:15:18 N 29.5245
20703 6409739 2022-03-13 03:20:20 N 29.6784
20704 6409741 2022-03-13 03:25:18.999000 N 28.6156
Then I perform the following transformation:
datetimes = pd.to_datetime(df_test[' SCED Timestamp '])
datetimes = pd.DatetimeIndex(datetimes)
datetimes = datetimes.tz_localize('America/Chicago', ambiguous=(df_test['Repeated Hour Flag'] == 'N'))
datetimes.name = "Datetime"
df_test = df_test.set_index([datetimes])
df_test
Batch ID SCED Timestamp Repeated Hour Flag System Lamda
Datetime
2022-03-13 01:40:17.999000-06:00 6409713 2022-03-13 01:40:17.999000 N 27.3663
2022-03-13 01:45:19-06:00 6409715 2022-03-13 01:45:19 N 26.2636
2022-03-13 01:50:18-06:00 6409719 2022-03-13 01:50:18 N 28.441
2022-03-13 01:55:18.999000-06:00 6409721 2022-03-13 01:55:18.999000 N 27.1946
2022-03-13 03:00:28-05:00 6409723 2022-03-13 03:00:28 N 19.4144
2022-03-13 03:05:20-05:00 6409731 2022-03-13 03:05:20 N 26.5737
2022-03-13 03:10:17.999000-05:00 6409733 2022-03-13 03:10:17.999000 N 30.4865
2022-03-13 03:15:18-05:00 6409735 2022-03-13 03:15:18 N 29.5245
2022-03-13 03:20:20-05:00 6409739 2022-03-13 03:20:20 N 29.6784
2022-03-13 03:25:18.999000-05:00 6409741 2022-03-13 03:25:18.999000 N 28.6156
Now if I convert the Datetime to Series and try to save it as csv with dateformate as "%s" and then read it I get the -1 instead of time in second. However if I try to save it in csv without specifying the format it reads it without any issue as mentioned above.
df_test.index.to_series().to_csv(path+'date_test.csv', date_format='%s')
pd.read_csv(path+'date_test.csv')
Datetime Datetime.1
0 1647135617 1647135617
1 1647135919 1647135919
2 1647136218 1647136218
3 1647136518 1647136518
4 -1 -1
5 -1 -1
6 -1 -1
7 -1 -1
8 -1 -1
9 -1 -1
Again I have this issue with Databrick Runtime: "10.4 LTS ML(Scala 2.12, Spark 3.2.1)" and tried Pandas version "1.4" and "1.2.4". My python version is 3.8.1. I also tried to run the same code on Jupyter with Pandas version = "1.2.4" and python 3.8.8 and I get the following error when I tried to save the file with "%s" format.
df_test.index.to_series().to_csv('date_test_save.csv', date_format ='%s')
ValueError Traceback (most recent call last)
<ipython-input-22-a07366f02b44> in <module>
----> 1 df_test.index.to_series().to_csv('date_test_save.csv', date_format ='%s')
~\Anaconda3\lib\site-packages\pandas\core\generic.py in to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, line_terminator, chunksize, date_format, doublequote, escapechar, decimal, errors, storage_options)
3385 )
3386
-> 3387 return DataFrameRenderer(formatter).to_csv(
3388 path_or_buf,
3389 line_terminator=line_terminator,
~\Anaconda3\lib\site-packages\pandas\io\formats\format.py in to_csv(self, path_or_buf, encoding, sep, columns, index_label, mode, compression, quoting, quotechar, line_terminator, chunksize, date_format, doublequote, escapechar, errors, storage_options)
1081 formatter=self.fmt,
1082 )
-> 1083 csv_formatter.save()
1084
1085 if created_buffer:
~\Anaconda3\lib\site-packages\pandas\io\formats\csvs.py in save(self)
246 )
247
--> 248 self._save()
249
250 def _save(self) -> None:
~\Anaconda3\lib\site-packages\pandas\io\formats\csvs.py in _save(self)
251 if self._need_to_save_header:
252 self._save_header()
--> 253 self._save_body()
254
255 def _save_header(self) -> None:
~\Anaconda3\lib\site-packages\pandas\io\formats\csvs.py in _save_body(self)
282
283 def _save_body(self) -> None:
--> 284 nrows = len(self.data_index)
285 chunks = int(nrows / self.chunksize) + 1
286 for i in range(chunks):
~\Anaconda3\lib\site-packages\pandas\io\formats\csvs.py in data_index(self)
176 ):
177 data_index = Index(
--> 178 [x.strftime(self.date_format) if notna(x) else "" for x in data_index]
179 )
180 return data_index
~\Anaconda3\lib\site-packages\pandas\io\formats\csvs.py in <listcomp>(.0)
176 ):
177 data_index = Index(
--> 178 [x.strftime(self.date_format) if notna(x) else "" for x in data_index]
179 )
180 return data_index
pandas\_libs\tslibs\timestamps.pyx in pandas._libs.tslibs.timestamps.Timestamp.strftime()
ValueError: Invalid format string