22

I am scoping out a project with large, mostly-uncompressible time series data, and wondering if Django + Postgres with raw SQL is the right call.

I have time series data that is ~2K objects/hour, every hour. This is about 2 million rows per year I store, and I would like to 1) be able to slice off data for analysis through a connection, 2) be able to do elementary overview work on the web, served by Django. I think the best idea is to use Django for the objects themselves, but drop to raw SQL to deal with the large time series data associated. I see this as a hybrid approach; that might be a red flag, but using the full ORM for a long series of data samples feels like overkill. Is there a better way?

Ben
  • 223
  • 1
  • 2
  • 4

4 Answers4

31

If I understand your thoughts correctly, you are considering storing the time series in PostgreSQL, one time series record in one database row. Don't do that.

On the one hand, the problem is theoretical. Relational databases (and I think most databases) are based on the premise of row independence, whereas the records of a time series are physically ordered. Of course, database indexes provide some order for database tables, but that order is meant to speed up searching or to present results alphabetically or in some other order; it does not imply any natural meaning to that order. Regardless how you order them, each customer is independent of other customers, and each customer's purchase is independent of his other purchases, even if you can get them altogether chronologically in order to form the customer's purchase history. The interdependence of time series records is much stronger, which makes relational databases inappropriate.

In practice, this means that the disk space taken up by the table and its indexes will be huge (maybe 20 times larger than storing the time series in files), and reading time series from the database will be very slow, something like an order of magnitude slower than storing in files. It will also not give you any important benefit. You probably aren't ever going to make the query "give me all time series records whose value is larger than X". If you ever need such a query, you will also need a hell of other analysis which the relational database has not been designed to perform, so you will read the entire time series into some object anyway.

So each time series should be stored as a file. It might be either a file on the file system, or a blob in the database. Despite the fact that I've implemented the latter, I believe the former is better; in Django, I'd write something like this:

class Timeseries(models.model):
    name = models.CharField(max_length=50)
    time_step = models.ForeignKey(...)
    other_metadata = models.Whatever(...)
    data = models.FileField(...)

Using a FileField will make your database smaller and make it easier to make incremental backups of your system. It will also be easier to get slices by seeking in the file, something that's probably impossible or difficult with a blob.

Now, what kind of file? I'd advise you to take a look at pandas. It's a python library for mathematical analysis that has support for time series, and it should also have a way to store time series in files.

I linked above to a library of mine which I don't recommend you to use; on the one hand it doesn't do what you want (it can't handle granularity finer than a minute, and it has other shortcomings), and on the other it's outdated - I wrote it before pandas, and I intend to convert it to use pandas in the future. There's a book, "Python for data analysis", by the author of pandas, which I've found invaluable.

Update (2016): There's also InfluxDB. Never used it and therefore I have no opinion, but it is definitely something that you need to examine if you are wondering how to store time series.

Update (2020-02-07): There's also TimescaleDB, an extension to PostgreSQL.

Update (2020-08-07): We changed our software (again) so that it stores the data in the database using TimescaleDB. We are already versed in PostgreSQL and it was easy to learn some TimescaleDB. The most important concrete advantage is that we can make queries like "find all locations where there was >50mm rain within 24 hours in 2019", something that would be very difficult when storing data in flat files. Another advantage is the integrity checks—over the years we had a few time series with duplicate rows because of little bugs here and there. The drawbacks are also significant. It uses 10 times more disk space. We may need to change our PostgreSQL backup policy because of that. It's slower. It takes maybe one second to retrieve a time series with 300k records. This was instant before. We needed to implement caching for retrieving time series, which wasn't needed before.

Antonis Christofides
  • 6,990
  • 2
  • 39
  • 57
  • I am not sure files will address my use case. I am looking at weather data, and so I take slices of 1) all history for a few places, 2) all places for a shorter slice of history (!month), and 3) max values across all locations for a time period. (Because weather is correlated on time and place, the different places can meaningfully inform eachother.) If I use a file implementation that is either place-major (each place gets a file) or time-major (each day/week/month gets a file), I will have to touch all the files if I pull the sorts of slices above. Is a DB definitely unworkable? – Ben Aug 10 '14 at 17:26
  • I misunderstood you; I thought you had one time series with 2k rows per hour; now I understand you have hourly time series at 2k locations. My opinion doesn't change, however. No, a relational database is not definitely unworkable, and I'm sure successful applications have been written with it. However I think it is suboptimal. It could be appropriate for you, however. I see that pandas has functionality to read and write data to and from a database. – Antonis Christofides Aug 11 '14 at 11:39
  • Would you suggest the same for open high low close data ? I was looking into time series but storing as a a pandas data frame would make everything for me much easier. – Aran Freel Aug 06 '20 at 07:35
  • I don't really know. I have some code that read from database to pandas and vice-versa, and from files to pandas and vice-versa. – Antonis Christofides Aug 07 '20 at 09:18
  • Take a look also at [VictoriaMetrics](https://github.com/VictoriaMetrics/VictoriaMetrics). There is high probability that it will show better performance at lower resource usage for your workload. – valyala Dec 05 '20 at 21:34
12

Time series databases seem to be one of those things that keep getting reinvented, and as suggested above, relational databases are not a good fit.

What I did was combine Django with InfluxDB, which is built for time series data. It's fantastic to use, and the python client libraries can work with pandas dataframes. That means you can either use the InfluxDB query language to play with the data in situ, or pull all of it (aggregated, if need be) for analysis within python. My app is handling data streams at a similar volume to what you need.

I link InfluxDB series names with django app/model/primary key as needed. Time series data goes in the linked InfluxDB series, miscellaneous slowly-changing or relational data goes into django ORM fields.

user2205880
  • 233
  • 2
  • 8
5

It sounds like you want to look into timescale. I have not used it myself yet, but it apparently is an extension of PostgreSQL, so I'd assume full Django support and it handles as they say

100Ks of rows and millions of metrics per second, even at 100 billion rows on a single node.

nuts
  • 715
  • 8
  • 22
0

You might also consider using the PostGIS postgres extension which includes support for raster data types (basically large grids of numbers) and has many features to make use of them.

However, do not use the ORM in this case, you will want to do SQL directly on the server. The ORM will add a huge amount of overhead for large numerical datasets. It's also not very adapted to handling large matrices within python itself, for that you need numpy.

mangecoeur
  • 893
  • 1
  • 6
  • 8