3

I am creating a Python desktop application that allows users to select different distributional forms to model agricultural yield data. I have the time series agricultural data - close to a million rows - saved in a SQLite database (although this is not set in stone if someone knows of a better choice). Once the user selects some data, say corn yields from 1990-2010 in Illinois, I want them to select a distributional form from a drop-down. Next, my function fits the distribution to the data and outputs 10,000 points drawn from that fitted distributional form in a Numpy array. I would like this data to be temporary during the execution of the program.

In an attempt to be efficient, I would only like to make this fit and the subsequent drawing of numbers one time for a specified region and distribution. I have been researching temporary files in Python, but I am not sure that is the best approach for saving many different Numpy arrays. PyTables also looks like an interesting approach and seems to be compatible with Numpy, but I am not sure it is good for handling temporary data. No SQL solutions, like MongoDB, seem to be very popular these days as well, which also interests me from a resume building perspective.

Edit: After reading the comment below and researching it, I am going to go with PyTables, but I am trying to find the best way to tackle this. Is it possible to create a table like below, where instead of Float32Col I can use createTimeSeriesTable() from the scikits time series class or do I need to create a datetime column for the date and a boolean column for the mask, in addition to the Float32Col below to hold the data. Or is there a better way to be going about this problem?

class Yield(IsDescription):
    geography_id = UInt16Col()
    data = Float32Col(shape=(50, 1)) # for 50 years of data

Any help on the matter would be greatly appreciated.

hotshotiguana
  • 1,520
  • 2
  • 26
  • 40

1 Answers1

1

What's your use case for the temporary data? Are you just going to be reading it all in at once (and never wanting to just read in a subset)?

If so, just save the array to a temporary file (e.g. with numpy.save, or equivalently, pickle with a binary protocol). There's no need for fancier solutions in that case.

On a side note, I'd highly recommend PyTables over SQLite for storing your original time series data.

Based on what it sounds like you're doing, you're not going to need the "relational" parts of a relational database (e.g. joins). If you don't need to join or relate tables, you just need fast simple queries, and you want the data in memory as a numpy array, PyTables is an excellent option. PyTables uses HDF to store your data, which can be much more compact on disk than a SQLite database. PyTables is also considerably faster for loading large chunks of data into memory as numpy arrays.

Joe Kington
  • 275,208
  • 71
  • 604
  • 463
  • I am probably going to be reading in subsets of the temporary data, for instance if a user selects more than one geography and runs the fit on the multiple geographies and then selects a couple other different geographies minus one of the first ones. I would like to first check the temporary data and if found, use that data, otherwise query the database and fit the new data. I looked into PyTables this afternoon and think it is a better option, but I have edited my original question to include a brief question on storing Scikits time series data in a table with other columns. – hotshotiguana Apr 10 '12 at 00:02
  • Well, keep in mind that drawing 10k random samples is pretty fast. It's not an answer, but in general I'd recommend not being "too clever". There's a good chance that generating the random samples is going to be faster than disk access. The slow part is likely to be fitting the distribution. You might think about just storing the distribution parameters (e.g. mean and stddev for a normal dist, etc), at which point you can get away with storing things in memory. Of course, all of that is pure speculation and doesn't get any closer to answering your question. – Joe Kington Apr 10 '12 at 02:26