7

I have a large data loaded from a pickled file. The data is a sorted list of tuples containing a datetime and an int like this

[ (datetime.datetime(2010, 2, 26, 12, 8, 17), 5594813L), 
  (datetime.datetime(2010, 2, 26, 12, 7, 31), 5594810L), 
  (datetime.datetime(2010, 2, 26, 12, 6, 4) , 5594807L),
  etc
]

I want to get a population density based on some time intervals. For example, I want to grab the number of records within 5 minute / 1 minute / 30 second periods.

What is the best method to do this? I know I can just loop through every instance in the list but was looking for a better approach (if one exists).

Desired output would be something like:

2010-01-01 04:10:00  --- 5000
2010-02-04 10:05:00  --- 4000
2010-01-02 13:25:00  --- 3999
sberry
  • 128,281
  • 18
  • 138
  • 165

2 Answers2

6

Check out itertools.groupby. You can pass a function that calculates the proper bucket as the key. Then, you can run your aggregations (counts, averages, what-have-you) on the groups in the resulting iterable.

Hank Gay
  • 70,339
  • 36
  • 160
  • 222
  • +1. This looks like a very good soltution, and I may end up going this route, but @~unutbu's answer is solving my issue perfectly. – sberry Feb 27 '10 at 00:30
  • Notable that itertools.groupby only produces sequential groups, (or, as stated in the docs: to group across entire collection given and produce "true" groups such SQL GROUP BY, etc, you must provide input sorted with the same key as given to the groupby function). Basically, itertools groupby just breaks a group off sequentially when the key produced by given key function differs from the key produced for the previous item. – dpb Apr 20 '20 at 13:24
6

bisect.bisect is another way to solve this problem:

import datetime
import bisect
import collections

data=[ (datetime.datetime(2010, 2, 26, 12, 8, 17), 5594813L), 
  (datetime.datetime(2010, 2, 26, 12, 7, 31), 5594810L), 
  (datetime.datetime(2010, 2, 26, 12, 6, 4) , 5594807L),
]
interval=datetime.timedelta(minutes=1,seconds=30)
start=datetime.datetime(2010, 2, 26, 12, 6, 4)
grid=[start+n*interval for n in range(10)]
bins=collections.defaultdict(list)
for date,num in data:
    idx=bisect.bisect(grid,date)
    bins[idx].append(num)
for idx,nums in bins.iteritems():
    print('{0} --- {1}'.format(grid[idx],len(nums)))
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677