5

I'm investigating solutions of storing and querying a historical record of event occurrences for a large number of items.

This is the simplified scenario: I'm getting a daily log of 200 000 streetlamps (labeled sl1 to sl200000) which shows if the lamp was operational on the day or not. It does not matter for how long the lamp was in service only that it was on a given calendar day.

Other bits of information are stored for each lamp as well and the beginning of the Python class looks something like this:

class Streetlamp(object):
    """Class for streetlamp record"""
    def __init__(self, **args):
        self.location = args['location']
        self.power = args['power']
        self.inservice = ???

My py-foo is not too great and I would like to avoid a solution which is too greedy on disk/memory storage. So a solution with a dict of (year, month, day) tuples could be one solution, but I'm hoping to get pointers for a more efficient solution.

A record could be stored as a bit stream with each bit representing a day of a year starting with Jan 1. Hence, if a lamp was operational the first three days of 2010, then the record could be:

sl1000_up = dict('2010': '11100000000000...', '2011':'11111100100...')

Search across year boundaries would need a merge, leap years are a special case, plus I'd need to code/decode a fair bit with this home grown solution. It seems not quiet right. speed-up-bitstring-bit-operations, how-do-i-find-missing-dates-in-a-list and finding-data-gaps-with-bit-masking where interesting postings I came across. I also investigated python-bitstring and did some googling, but nothing seems to really fit.

Additionally I'd like search for 'gaps' to be possible, e.g. 'three or more days out of action' and it is essential that a flagged day can be converted into a real calendar date.

I would appreciate ideas or pointers to possible solutions. To add further detail, it might be of interest that the back-end DB used is ZODB and pure Python objects which can be pickled are preferred.

Cœur
  • 37,241
  • 25
  • 195
  • 267
Axial
  • 125
  • 7

2 Answers2

5

Create a 2D-array in Numpy:

import numpy as np

nbLamps = 200000
nbDays = 365

arr = np.array([nbLamps, nbDays], dtype=np.bool)

It will be very memory-efficient and you can aggregate easily the days and lamps.

In order to manipulate the days even better, have a look at scikits.timeseries. They will allow you to access the dates with datetime objects.

eumiro
  • 207,213
  • 34
  • 299
  • 261
  • Thanks for pointing out scikits.timeseries. It seems to support most of the analysis I have to do. Storing all lamps in one array for one year isn't quiet feasible, since I'd rather store the record for one lamp in the instantiated object. However, this should be easy to adapt and with numpy I don't have top reinvent the wheel. Only a Python noob could overlook this package ;-) – Axial Jan 19 '11 at 21:38
  • 2
    It's worth knowing that a numpy bool is stored as a whole byte, so this might not be as memory efficient as it seems. – Scott Griffiths Jan 20 '11 at 00:05
0

I'd probably dictionary the lamps and have each of them contain a list of state changes where the first element is the time of the change and the second the value that's valid since that time.

This way when you get to the next sample you do nothing unless the state changed compared to the last item.

Searching is quick and efficient as you can use binary search approaches on the times.

Persisting it is also easy and you can append data to an existing and running system without any problems too as well as dictionary the lamp state lists to further reduce resource usage.

If you want to search for a gap you just go over all the items and compare the next and prev times - and if you decided to dictionary the state lists then you'll be able to do it just once for every different list rather then every lamp and then get all the lamps that had the same "offline" states with just one iteration which may sometimes help

RnR
  • 2,096
  • 1
  • 15
  • 23
  • Thanks! I like that this solution would be easy to extend. The record could be stored nicely. Still, I'll have to write a bit of scaffolding which might already exist in pyland (maybe some scientific data crunching module). – Axial Jan 19 '11 at 21:52