0

I was fearing this step for a couple of months now, but it is inevitable that I make sense of the time-mess we're currently in. I know that there's a lot of information available, but I still have problems applying the concept of local time to my problem:

Basically, we have a Postgres Database where we store different Social Media Posts (Foursquare, Flickr etc.) for analysis. The API's usually return Unix timestamps, which are always UTC as far as I know.

The problem is: When we analyze this data, we don't care about UTC, we need to know the local time of the user when he/she posted on Social media. Assume, for example, someone took a picture of the Opera House in Sydney at 4PM local Sydney time (GMT+10). How do we store this in Postgres - timestamp with timezone or timestamp without timezone? Is the local time of the user available at all - or do we have to take the Geocoordinates into account (if available) to calculate local time of the user?

Summarized:

  • we don't care about the time the analyst/investigator is in when viewing the data in Postgres
  • we don't care about the offset to UTC/GMT, it would be nice to have it (to know the general location), but it is not really important
  • we only care about the subjective, personal time the user perceived when posting/taking pictures

What would be the best approach to store metadata given these circumstances?

[Edit] I made a query to see where users mention "o'clock" in their posts, to compare this to the post_publish_date (timestamp without timezone), as returned from the API). Surprisingly, What I see suggests that all these timestamps are local time, not UTC:

"post_publish_date","post_body"
"2016-12-06 07:27:07","[...] at 8 o'clock a.m. [...]"
"2018-02-22 05:21:53","[...] main 6 o'clock road to [...]"
"2018-01-27 06:13:04","[...] get up early otherwise you miss [...] 6 o'clock [...]"
"2018-02-09 16:21:37","It's Friday [...] its gotta be 5 o'clock [...]"
"2018-02-02 15:44:21","It's Friday, [...] it's always 5 o'clock [...]"
"2015-11-21 02:37:53","[...] until 4 o'clock in the morning. [...]"
"2017-09-15 07:51:53","[...] 9 o'clock at night[...]"
"2017-12-18 19:52:40","[...]Date: ♨18.12.2017  [...] 20o'clock [...] Location: New York[...]"

This is a great post that explains the differences, but it is said that a Unix timestamp can be in any timezone - UTC or local time and one never knows unless the timezone is given. Now the question is: can I store posts with timestamps without timezone in the same column (formatted as timestamp with timezone) as those posts that have a timezone specified?

Here is another post that described how to retrieve ocal time for tweets from the user's respective coordinates when tweeting. Therefore, the time that at leats Twitter returns is UTC time, not local time.

Alex
  • 2,784
  • 2
  • 32
  • 46

1 Answers1

0

After some testing, I think I got the correct answer:

Different social media services handle timestamps and timezones differently. Three aspects of time are relevant:

  • (1) the local time of the user (when the post was created or published)
  • (2) the absolute UTC/GMT time (the number of seconds passed since Jan 1, 1970 in UTC, which is the same as GMT), which is relevant for example for global comparisons of times
  • (3) the local time of viewer (e.g. when viewing a tweet online, this prevents the paradox of seeing something that appears published in the future)

Example: (1) A Tweet was published at 9:37 in Berlin (+2 CEST), (2) Twitter stores this as 7:37 UTC (+00), and (3) when this tweet is viewed in California (-7 PT), the time that the user sees is 12:37 AM (translated to local time of the viewer).

While (3) is relevant to Web-Apps, in data analysis, one is usually interested in the local time of the contributing user, not the UTC timestamp or the local time of the viewer. For Flickr and Instagram, for example, this time is directly available from the API. For Twitter, the local time must be calculated, based on additional criteria that are sometimes available.

What does this mean for storing data in Postgres?

In postgres, there are two options: saving times as timestamp without timezone or timestamp with timezone. However, Postgres will never store timezone information in the timestamp, the 'timezone' here refers only to how the timestamp is formatted when displaying Postgres timestamps (3). Therefore, when storing data for analysis, timestamps should not be processed. They're dealt with as a timestamp without timezone because the timezone of the contributing user is initially unknown (without taking into account additional information). For some services, such as Twitter, it is the task of the analyst to translate this time to the local time of the user before analysis (e.g. by taking into account utc_offset, location of post, language or other attributes).

Alex
  • 2,784
  • 2
  • 32
  • 46