0

This is a followup to my earlier question Pymongo aggregation - passing python list for aggregation

I have some statistical data stored in half-hourly buckets. Each of the half-hourly buckets has an associated time-stamp in GMT.

My goal is to aggregate the data present in half-hourly buckets into daily buckets based on any timezone.

So far here is what I'm trying to do:

  1. Converting query start and end dates into GMT-equivalent time.

  2. Using the start and end-date in the $match stage, to filter documents that are relevant to me.

  3. Add timezone offset to individual timestamps, then group them based on date.

I'm able to accomplish the above, but the problem is with the offsets and grouping them into the right bucket based on timezone.

For instance, when I try to do it, it aggregates and groups the relevant documents - but it does so using GMT time.

How can I aggregate the data based on a timezone that I would pass, and how to assign relevant date-ranges for the same?

When I try to sum up for the entire query range, "overall" data appears to add up correctly, however day-wise does not fall into the right timezone-adjusted date-buckets, all the buckets are still in GMT.

Here's my code

def getTimezoneOffset(app_tz):
    apptz_now = datetime.datetime.now(pytz.timezone(app_tz))
    milliseconds_offset = apptz_now.utcoffset().total_seconds()*1000
    return milliseconds_offset

c_start_date = translate_to_gmt(start_date, app_tz)
c_end_date = translate_to_gmt(end_date, app_tz)

client = MongoClient('mydbname.com')
db = client[db_name]
halfhourly_collection = db['HalfHourlyBuckets']
reference_collection = db['StructforBuckets']


project = { "$project": {} }
group = { "$group": {} }
keys_list = []

document_cursor = reference_collection.find({"_id": ObjectId((cid)) })
for document in document_cursor:
    temp_keylist = keys_to_aggregate(document, '', [])
    for element in temp_keylist:
        if element != "_id":
            keys_list.append(element)
pprint.pprint(keys_list)
# match = { "$match": { "cid": ObjectId(cid) } }
match = { "$match": { "cid": ObjectId(str(cid)), "t":{'$gte':c_start_date, '$lte':c_end_date} } }

project = { "$project": {} }
group = { "$group": {} }
timezone_offset = getTimezoneOffset(app_tz)
project["$project"]["day"] = {"$add": ["$t", timezone_offset]}
group["$group"]["_id"] = { "$substr": ["$day",  0, 10] }

project["$project"]["cid"] = "$cid"

for item in keys_list:

    # project["$project"]["day"] = { "$substr": ["$t",  0, 10] }
    # group["$group"]["_id"] = "$day"
    sum = {"$sum": ""}
    sum["$sum"] = "$" + item.replace(".", "\u002e")
    project["$project"][item.replace(".", "\u002e")] = "$" + item
    group["$group"][item.replace(".", "\u002e")] = sum
#\u002e is just a delimiter that I'm using for nesting/un-nesting it later
pipeline = [match, project, group]
# pprint.pprint(pipeline)
output = halfhourly_collection.aggregate(pipeline = pipeline)
pprint.pprint(unflatten(list(output['result']), cid))
client.close()
elevenmac
  • 23
  • 4

0 Answers0