2

I have the following localize datetime series. When I am trying to save the date/time as second some datetimes getting saved as -1 instead of actual date. Interestingly this happens when time saving is happening, does anyone know how to fix this? here is the code.

test_time_df

here is the output of of test_time_df

Out[4]: Datetime
2022-03-13 01:40:17.999000-06:00   2022-03-13 01:40:17.999000-06:00
2022-03-13 01:45:19-06:00                 2022-03-13 01:45:19-06:00
2022-03-13 01:50:18-06:00                 2022-03-13 01:50:18-06:00
2022-03-13 01:55:18.999000-06:00   2022-03-13 01:55:18.999000-06:00
2022-03-13 03:00:28-05:00                 2022-03-13 03:00:28-05:00
2022-03-13 03:05:20-05:00                 2022-03-13 03:05:20-05:00
2022-03-13 03:10:17.999000-05:00   2022-03-13 03:10:17.999000-05:00
2022-03-13 03:15:18-05:00                 2022-03-13 03:15:18-05:00
2022-03-13 03:20:20-05:00                 2022-03-13 03:20:20-05:00
2022-03-13 03:25:18.999000-05:00   2022-03-13 03:25:18.999000-05:00
Name: Datetime, dtype: datetime64[ns, America/Chicago]

when I try to save it to csv thats what I get

test_time_df.to_csv('test_01.csv', date_format='%s')

when I try to read the data with pandas or in linux. I get the following results.

Datetime,Datetime
1647135617,1647135617
1647135919,1647135919
1647136218,1647136218
1647136518,1647136518
-1,-1
-1,-1
-1,-1
-1,-1
-1,-1
-1,-1

By the way I am using Databricks with 'pandas == 1.4' and numpy==1.20.1.

Here is additional information: the original dataframe is df_test with following input:

import datetime
df_test = pd.DataFrame({"Batch ID":[6409713, 6409715, 6409719, 6409721, 6409723, 6409731, 6409733,
       6409735, 6409739, 6409741],"      SCED Timestamp      ":[datetime.datetime(2022, 3, 13, 1, 40, 17, 999000),
       datetime.datetime(2022, 3, 13, 1, 45, 19),
       datetime.datetime(2022, 3, 13, 1, 50, 18),
       datetime.datetime(2022, 3, 13, 1, 55, 18, 999000),
       datetime.datetime(2022, 3, 13, 3, 0, 28),
       datetime.datetime(2022, 3, 13, 3, 5, 20),
       datetime.datetime(2022, 3, 13, 3, 10, 17, 999000),
       datetime.datetime(2022, 3, 13, 3, 15, 18),
       datetime.datetime(2022, 3, 13, 3, 20, 20),
       datetime.datetime(2022, 3, 13, 3, 25, 18, 999000)],
       "Repeated Hour Flag": ['N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N'],
              "System Lamda": [27.3663, 26.2636, 28.441, 27.1946, 19.4144, 26.5737, 30.4865,
       29.5245, 29.6784, 28.6156]         })

df_test
Batch ID    SCED Timestamp  Repeated Hour Flag  System Lamda
20695   6409713 2022-03-13 01:40:17.999000  N   27.3663
20696   6409715 2022-03-13 01:45:19 N   26.2636
20697   6409719 2022-03-13 01:50:18 N   28.441
20698   6409721 2022-03-13 01:55:18.999000  N   27.1946
20699   6409723 2022-03-13 03:00:28 N   19.4144
20700   6409731 2022-03-13 03:05:20 N   26.5737
20701   6409733 2022-03-13 03:10:17.999000  N   30.4865
20702   6409735 2022-03-13 03:15:18 N   29.5245
20703   6409739 2022-03-13 03:20:20 N   29.6784
20704   6409741 2022-03-13 03:25:18.999000  N   28.6156

Then I perform the following transformation:

datetimes = pd.to_datetime(df_test['      SCED Timestamp      '])
datetimes = pd.DatetimeIndex(datetimes)
datetimes = datetimes.tz_localize('America/Chicago', ambiguous=(df_test['Repeated Hour Flag'] == 'N'))
datetimes.name = "Datetime"
df_test = df_test.set_index([datetimes])
df_test
Batch ID    SCED Timestamp  Repeated Hour Flag  System Lamda
Datetime                
2022-03-13 01:40:17.999000-06:00    6409713 2022-03-13 01:40:17.999000  N   27.3663
2022-03-13 01:45:19-06:00   6409715 2022-03-13 01:45:19 N   26.2636
2022-03-13 01:50:18-06:00   6409719 2022-03-13 01:50:18 N   28.441
2022-03-13 01:55:18.999000-06:00    6409721 2022-03-13 01:55:18.999000  N   27.1946
2022-03-13 03:00:28-05:00   6409723 2022-03-13 03:00:28 N   19.4144
2022-03-13 03:05:20-05:00   6409731 2022-03-13 03:05:20 N   26.5737
2022-03-13 03:10:17.999000-05:00    6409733 2022-03-13 03:10:17.999000  N   30.4865
2022-03-13 03:15:18-05:00   6409735 2022-03-13 03:15:18 N   29.5245
2022-03-13 03:20:20-05:00   6409739 2022-03-13 03:20:20 N   29.6784
2022-03-13 03:25:18.999000-05:00    6409741 2022-03-13 03:25:18.999000  N   28.6156

Now if I convert the Datetime to Series and try to save it as csv with dateformate as "%s" and then read it I get the -1 instead of time in second. However if I try to save it in csv without specifying the format it reads it without any issue as mentioned above.

df_test.index.to_series().to_csv(path+'date_test.csv', date_format='%s')
pd.read_csv(path+'date_test.csv')
Datetime    Datetime.1
0   1647135617  1647135617
1   1647135919  1647135919
2   1647136218  1647136218
3   1647136518  1647136518
4   -1  -1
5   -1  -1
6   -1  -1
7   -1  -1
8   -1  -1
9   -1  -1

Again I have this issue with Databrick Runtime: "10.4 LTS ML(Scala 2.12, Spark 3.2.1)" and tried Pandas version "1.4" and "1.2.4". My python version is 3.8.1. I also tried to run the same code on Jupyter with Pandas version = "1.2.4" and python 3.8.8 and I get the following error when I tried to save the file with "%s" format.

df_test.index.to_series().to_csv('date_test_save.csv', date_format ='%s')
ValueError                                Traceback (most recent call last)
<ipython-input-22-a07366f02b44> in <module>
----> 1 df_test.index.to_series().to_csv('date_test_save.csv', date_format ='%s')

~\Anaconda3\lib\site-packages\pandas\core\generic.py in to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, line_terminator, chunksize, date_format, doublequote, escapechar, decimal, errors, storage_options)
   3385         )
   3386 
-> 3387         return DataFrameRenderer(formatter).to_csv(
   3388             path_or_buf,
   3389             line_terminator=line_terminator,

~\Anaconda3\lib\site-packages\pandas\io\formats\format.py in to_csv(self, path_or_buf, encoding, sep, columns, index_label, mode, compression, quoting, quotechar, line_terminator, chunksize, date_format, doublequote, escapechar, errors, storage_options)
   1081             formatter=self.fmt,
   1082         )
-> 1083         csv_formatter.save()
   1084 
   1085         if created_buffer:

~\Anaconda3\lib\site-packages\pandas\io\formats\csvs.py in save(self)
    246             )
    247 
--> 248             self._save()
    249 
    250     def _save(self) -> None:

~\Anaconda3\lib\site-packages\pandas\io\formats\csvs.py in _save(self)
    251         if self._need_to_save_header:
    252             self._save_header()
--> 253         self._save_body()
    254 
    255     def _save_header(self) -> None:

~\Anaconda3\lib\site-packages\pandas\io\formats\csvs.py in _save_body(self)
    282 
    283     def _save_body(self) -> None:
--> 284         nrows = len(self.data_index)
    285         chunks = int(nrows / self.chunksize) + 1
    286         for i in range(chunks):

~\Anaconda3\lib\site-packages\pandas\io\formats\csvs.py in data_index(self)
    176         ):
    177             data_index = Index(
--> 178                 [x.strftime(self.date_format) if notna(x) else "" for x in data_index]
    179             )
    180         return data_index

~\Anaconda3\lib\site-packages\pandas\io\formats\csvs.py in <listcomp>(.0)
    176         ):
    177             data_index = Index(
--> 178                 [x.strftime(self.date_format) if notna(x) else "" for x in data_index]
    179             )
    180         return data_index

pandas\_libs\tslibs\timestamps.pyx in pandas._libs.tslibs.timestamps.Timestamp.strftime()

ValueError: Invalid format string
Shahin Shirazi
  • 371
  • 3
  • 14
  • I just saw this post "https://stackoverflow.com/questions/13999850/how-to-specify-date-format-when-using-pandas-to-csv" that because "%s" is not documented in python/pandas, it may not be stable. I may just stick with other format (i.e: parquet) or save the date time as string. – Shahin Shirazi Aug 18 '22 at 14:03
  • I couldn't reproduce this on macOS with `pandas == 1.4` and `numpy==1.20.1`. – aaron Aug 27 '22 at 07:36
  • yes, I cant reproduce it unless I am running databrick. this is specific to databricks. – Shahin Shirazi Aug 27 '22 at 21:30

0 Answers0