3

I need to store a bunch of time series in a database but I'm concerned about both the size and the processing time.

To reduce size I've already used, in another project, zipped/JSON to store a whole time series, and this is quite efficient in term of storage space. But the issue is that to search some data you have to first retrieve the whole time series, unzip it and unserialize it and of course you can't use the database integrated querying capabilities like SQL SELECT/WHERE.

So you consume bandwidth to get the data, CPU to unzip, RAM to store even if you need only one point...

This was not an issue for the previous project because time series were always manipulated as a whole, essentially to be displayed in charts or Excel, but this time I'd like to have a minimal ability to search data in database.

To allow this flexibility in term of data manipulation, e.g. using SQL, there is the "standard format": one row by date, but I have two concerns:

  • a time series over 10 years could have 3000 values so it means 3000 rows, so imagine if I have 1M time series I could have 3G rows! I'm not sure a "normal" database like MySQL or PostgreSQL could handle such a huge numbers of rows but hope I'm wrong
  • I don't know if DBMS are so good at optimizing the space required by all the cells, though while it's not "too" big it's OK

I can choose any free database, so NoSQL is welcome too if it can help.

Have you any suggestions, or better some feedbacks?

Thanks for any input.

Pragmateek
  • 13,174
  • 9
  • 74
  • 108
  • 2
    3 Billion rows is large but not extreme. With appropriate indexes you should get perfectly reasonable performance. But you won't fit it into 20GB (thats only just over 6bytes per datapoint). Why is 20GB your limit for a database table? – MatBailie Jun 10 '13 at 17:49
  • @Dems: glad it's not extreme in your opinion, this is good news, but I've never worked with more than 1M rows. :) 20GB is quite an arbitrary limitation because some hosted databases are really expensive but I've seen more reasonable plans. I'll remove it. Thanks :) – Pragmateek Jun 10 '13 at 18:12
  • FYI just have a look at the cool Heroku plans: https://postgres.heroku.com/pricing not limited (1T seems like infinite for my needs ;) ) in database size but in cache size, don't know if a reasonable 800Mo would be sufficient. But what is good with the cloud is you can change the plan smoothly as the needs evolve. – Pragmateek Jun 10 '13 at 18:15
  • @Hazzit: well almost but it's more than two years old and the database landscape has evolved. The great answer of *PerformanceDBA* is in accordance with what *Dems* just said. +1 Seems like I'll go the PostgreSQL Heroku way and taking care of the indexes. Thanks – Pragmateek Jun 10 '13 at 19:29
  • 1
    If I understood your situation, it is unlikely that the same query will be run multiple times in succession. If that is true, caching will not have much effect on performance. Also, please note that at 100 USD/month, you have a ton of options both as Platform as a Service (cloud solutions) and self-managed (your own server). – Hazzit Jun 11 '13 at 13:47
  • @Hazzit: thanks for the information, I'll try it and should quickly see if it fits my needs... – Pragmateek Jun 11 '13 at 13:59

1 Answers1

2

Checkout TempoDB: http://tempo-db.com

I'm a co-founder, and we built the service to solve this exact problem.

The access pattern is writing data in order by time, usually not editing it (highly immutable), and then reading data back by time.

The fundamental issue you'll face is indexing on a timestamp, where there are many billions of rows. You want to decouple query performance from the underlying total dataset size, which will always be growing at least linearly. We do all that stuff... and more :)

Andrew C
  • 1,036
  • 2
  • 9
  • 19
  • I've quickly glanced at your offer and it seems interesting. But how would it compare to a service like Heroku Postgres? We don't need additional services like data visualization, just a raw data store with decent storage space and performance, and don't have a too important availability constraint. Thanks – Pragmateek Jun 12 '13 at 23:32
  • Both TempoDB and Heroku Postgres are hosted services, so they are similar in that respect. The difference is that Postgres is a general purpose relational database, and TempoDB is purpose-built for time series. So common operations on time series data, like rollups and shifting timezones, are built into TempoDB: https://tempo-db.com/docs/api/read/#read-rollups – Andrew C Jun 13 '13 at 15:50
  • 1
    I've bookmarked it, may give it a try, and even benchmark it against Heroku Postgres. :) – Pragmateek Jun 13 '13 at 16:09