2

I have a dataframe 'pages' with a timestamp field 'dimension3'. dimension3 is supposed to be ISO time but there are some nonsensical offsets causing problems.

The dataframe is pageviews of a website from visitors all over the world, so each row has it's own timestamp and timezone offset.

The offset for ISO time should be within the range -12 and +14.

Most of my timestamps fit within this range. Here is an example sound data point:

x = dateutil.parser.parse('2019-11-11T07:08:09.640-4:00')
x
datetime.datetime(2019, 11, 11, 7, 8, 9, 640000, tzinfo=tzoffset(None, -14400))

Here is an example of a problematic data point that appears in my data frame:

y = dateutil.parser.parse('2019-11-11T07:08:09.640-31:00')
y
datetime.datetime(2019, 11, 11, 7, 8, 9, 640000, tzinfo=tzoffset(None, -111600))

The problematic one has an offset of -31 which is greater than the minimun bound of -12.

This is problematic because when I try to send these data to a postgres database with a field type of timestamptz I get an error back that the data failed to upload due to some data points being outwith the acceptable bounds.

I've spent a bit of time going through the responses on this post and this one to see if there's some out of the box solution to account for these out of bounds timezone offsets.

None of the iso functions mentioned on the various packages helped me. I'm thinking I might have to apply a lambda function to each timestamp with some if_else() logic to read the offset and if the offset is below 12, then make it just 12, and if the offset is greater than 14, then just set it to 14.

How can I do that assuming that my dataframe is called pages and my timestamp column is called dimension3?

Other, more elegant solutions also welcome.

Doug Fir
  • 19,971
  • 47
  • 169
  • 299
  • Wouldn't it make more sense to fix the problem "upstream": prevent that such non-sensical timestamps will be generated in the first place. A timezone is not only a difference with the UTC time: a timezone can sometimes change the offset, since a country can decide to change that. – Willem Van Onsem Dec 15 '19 at 17:33
  • We cannot replace data from the API retroactively and also it's beyond scope of my immediate task of getting these data into Postgres. It's also a tiny minority of data points – Doug Fir Dec 15 '19 at 17:36
  • Absolute fallback option is converting the field toa string and then uploading to postgresbut I really want to avoid that – Doug Fir Dec 15 '19 at 17:37
  • I found this routine that converts to local timezone, if we can figure out how to pull the timezone, it might work for what your doing: def utc_to_local(utc_dt): return utc_dt.replace(tzinfo=timezone.utc).astimezone(tz=None) I'm still looking into this, i'm interested in the answer as well – oppressionslayer Dec 15 '19 at 17:55
  • Thanks for your help. I'm trying to find a way to identify problem timestamps to apply this function to – Doug Fir Dec 15 '19 at 17:58
  • @DavisHerring the majority of timestamps are sound and contain the timezone offset against UTC. So they mean that the pageview of our website took place at whatever time in whatever timezone offset. For those small minority timestamps with offets outside the accepted range of -12, +14 timezone offset, I'd like to set it to the min or max, -12 or +14 depending if it's positive or negative offset – Doug Fir Dec 15 '19 at 19:08
  • @DougFir: You actually said that, sorry. Show the code you’ve tried for the adjusting function. – Davis Herring Dec 15 '19 at 19:09
  • @DavisHerring this is what I'm using right now. I'm just dropping the offset alltogether because I do not know how to determine if the offset is out of bounds: ```pages['dimension3'] = pages['dimension3'].apply(lambda x: x.replace(tzinfo = None))``` – Doug Fir Dec 15 '19 at 19:11
  • In an ideal world I would only use that lambda function in conjunction with an if clause, where if the offset is below 12 then set it to exactly 12, and if above 14 then set it to exactly 14. I do not know how to check and reset offsets in this way – Doug Fir Dec 15 '19 at 19:13
  • I think this solution is quite good for your problem (just manually convert it to GMT): https://stackoverflow.com/a/38992733/11610186 – Grzegorz Skibinski Dec 15 '19 at 19:30
  • @DougFir: Can’t you just consult [`utcoffset`](https://docs.python.org/3/library/datetime.html#datetime.tzinfo.utcoffset) in the lambda and compute a new [`timezone`](https://docs.python.org/3/library/datetime.html#datetime.timezone) object based on it (when not just reusing the input because it’s valid)? – Davis Herring Dec 15 '19 at 19:37

1 Answers1

1

I used a custom function with try except within a lambda:

def rogue_tz_offsets(t):
    """
    try to convert to timestamp and if it fails remove timezone offset
    """
    t = dateutil.parser.parse(t)
    try:
        return t.isoformat()
    except:
        t = t.replace(tzinfo = None)
        return t.isoformat()

And then

pages['dimension3'] = pages['dimension3'].apply(lambda x: rogue_tz_offsets(x))
Doug Fir
  • 19,971
  • 47
  • 169
  • 299