0

I have some code I wrote using Pandas which does the exact processing I want, but unfortunately is slow. In an effort to speed up processing times, I have gone down the path of converting the dataframe to a list of tuples, where each tuple is a row in the dataframe.

I have found that the datetime.datetime objects are converted to long ints, 1622623719000000000 for example.

I need to calculate the time difference between each row, so my thought was 'ok, I'm not great at python/pandas, but I know I can do datetime.fromtimestamp(1622623719000000000) to get a datetime object back.

Unfortunately, datetime.fromtimestamp(1622623719000000000) throws OSError: [Errno 22] Invalid argument.

So, off to Google/SO to find a solution. I find this example which shows dividing the long int by 1e3. I try that, but still get 'invalid argument.'

I play around with the division of the long int, and dividing by 1e9 gets me the closest to the original datetime.datetime value, but not quite.

How do I successfully convert the long int back to the correct datetime value?

Code to convert string format to datetime:

df.start_time = pd.to_datetime(df.report_date + " " + df.start_time)

Info on dataframe:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 19 columns):
report_date      46 non-null object
...
...
...
start_time       46 non-null datetime64[ns]
...
...
...

dtypes: datetime64[ns](1), float64(7), int64(1), object(10)
memory usage: 6.9+ KB
None

My test code:

print("DF start time", df.start_time[5], "is type", type(df.start_time[5]))
print("list start time", tup_list[5][7], "is type", type(tup_list[5][7]),"\n")

print("Convert long int in row tuple to datetime")
print(datetime.fromtimestamp(int(1622623719000000000/1e9)))

Output:

DF start time 2021-06-02 08:16:33 is type <class 'pandas._libs.tslibs.timestamps.Timestamp'>
list start time 1622623719000000000 is type <class 'int'> 

Convert int in row tuple to datetime
2021-06-02 03:48:39

2 Answers2

0

Change the dtype of your column start_time to convert Timestamp to an integer (nanoseconds):

df = pd.DataFrame({'start_time': ['2021-06-02 08:16:33']}) \
       .astype({'start_time': 'datetime64'})

>>> df
           start_time
0 2021-06-02 08:16:33

>>> df['start_time'].astype(int)
0    1622621793000000000  # NOT 1622623719000000000
Name: start_time, dtype: int64

>>> pd.to_datetime(1622621793000000000)  # Right
Timestamp('2021-06-02 08:16:33')

>>> pd.to_datetime(1622623719000000000)  # Wrong
Timestamp('2021-06-02 08:48:39')
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • or maybe your indexing is not good. What are the values before and after `Timestamp('2021-06-02 08:48:39')` in your dataframe? – Corralien Jul 01 '21 at 20:50
  • I copied/pasted the above code to convert the column to `astype(int)` and got `TypeError: cannot astype a datetimelike from [datetime64[ns]] to [int32]`. My original code to convert from string format to datetime - `df.start_time = pd.to_datetime(df.start_time)` – luser3228561 Jul 06 '21 at 14:03
  • Please update your code with a sample of your original dataframe and the output of `df.info()`. – Corralien Jul 06 '21 at 15:35
  • Added as requested to original post. – luser3228561 Jul 06 '21 at 19:25
  • What is your version of Python? Are you working on a 32bits machine (int32)? If you want to solve the problem, I think you should use `df['start_time'].astype(np.int64)`. I can reproduce your error with: `df['start_time'].astype(np.int32)` – Corralien Jul 06 '21 at 19:40
  • Python version 3.7, 64-bit. I used `df['start_time'].astype(np.int64)` to create the `numpy.int64`, and the conversion back to a Timestamp works when I convert `df.start_time` back to `Timestamp`. But, to process my logic faster, I convert the df to a list of tuples via `df.to_records(index=False).tolist()` The issue appears to be when the df rows are converted to the list of tuples, `df.start_time` is converted from `` to ``, `1622621793000000000` in the df (correct) to `1622623719000000000` in the tuple (incorrect). – luser3228561 Jul 07 '21 at 13:15
  • Is my method not fast enough? It seems you make many operations (browsing and indexing list that can be slow. – Corralien Jul 07 '21 at 13:53
  • Your process is fine, it's my processing of other data in the df that takes time. I'm just learning Pandas, and I know the worst thing you can do is iterate through a df rows, which I'm currently doing. I put the data into a df to get it into a workable set, and now convert it back to a list of tuples to process faster by iterating through a list. – luser3228561 Jul 07 '21 at 18:34
0

I solved this problem, and probably should have initially went with in the first place.

I originally converted my dataframe to a list of tuples for faster row processing:

df.to_records(index=False).tolist()

Unfortunately in the conversion the values in df.start_time were converted from a <class 'pandas._libs.tslibs.timestamps.Timestamp'> to a <class int>

The solution:

df.to_dict('records')

This preserved the data in df.start_time as Timestamp types:

'start_time': Timestamp('2021-06-02 08:16:33')

Lesson learned.