Suppose we have a dataset with a UNIX timestamp in milliseconds:
data = [
{
"unix_ts": 1669291200000,
"val": 10
},
{
"unix_ts": 1669291260000,
"val": 25
}
]
Which we convert to a Pandas dataframe with a Pandas timestamp (datetime) set to US/Eastern:
df = pd.DataFrame(data)
df['ET'] = pd.to_datetime(df['unix_ts'], unit='ms').dt.tz_localize('UTC').dt.tz_convert('US/Eastern')
unix_ts val ET
0 1669291200000 10 2022-11-24 07:00:00-05:00
1 1669291260000 25 2022-11-24 07:01:00-05:00
We can see that the ET time is 5 hours behind the UTC unix_ts
Suppose we want a new integer column with a value that corresponds with that -5 hours difference. Naively, we could do this:
df['adjusted_ts'] = df['unix_ts'] - (3600000 * 5)
# Include column to allow us to check the result by eye.
df['Check_ET'] = pd.to_datetime(df['adjusted_ts'], unit='ms').dt.tz_localize('UTC').dt.tz_convert('US/Eastern')
Result:
unix_ts val ET adjusted_ts Check_ET
0 1669291200000 10 2022-11-24 07:00:00-05:00 1669273200000 2022-11-24 02:00:00-05:00
1 1669291260000 25 2022-11-24 07:01:00-05:00 1669273260000 2022-11-24 02:01:00-05:00
We can see from the Check_ET column that this "works", but it won't when we get to the part of the year when US/Eastern is only 4 hours behind UTC.
It would be handy to be able to do something like this:
import numpy as np
df['smart_adjusted_ts'] = (df['ET'].astype(np.int64) // 10 ** 9) * 1000
But, sadly, that's not so 'smart', as it results in a column that's identical to the original unix_ts (Regardless of the ET column's timezone, the underlying data (NumPy) is always stored as nanoseconds since the EPOCH in UTC.):
unix_ts ... Check_ET smart_adjusted_ts
0 1669291200000 ... 2022-11-24 02:00:00-05:00 1669291200000
1 1669291260000 ... 2022-11-24 02:01:00-05:00 1669291260000
So, unless there is a special method to do this (Anyone?), my only thought is to go back to the original approach but dynamically extract the UTC offset (ideally as ints: 4 or 5) from the ET column.
The problem is, I can't find how to do that either, but I'm hoping to achieve something like this:
df['adjusted_ts'] = df['unix_ts'] - (3600000 * et_utc_abs_diff)
Please be aware that a dataset could include dates with both (4 & 5 hour) differences, so it's important to get this difference on a row-by-row basis as opposed to having a master variable set to 4 or 5.
Any ideas for an elegant solution, please?
Edit
I came up with the following, and it gets the right result, but I suspect there must be a better way using standard Pandas methods.
df['adjusted_ts'] = df['unix_ts'] - (df['ET'].astype(str).str.slice(start=21, stop=22).astype(int) * 3600000)