0

There are several SO questions about time series databases , but none that address my specific concerns, and although this one comes closest, it's 3 years old.

Requirements:

  1. Multiple datasets. It doesn't matter how they're organized (separate tables, databases, processes, files, etc.).
  2. Single host operation (at least initially), so we're limited to approximately 1TB disk and 10GB RAM.
  3. Read latency/throughput are the key performance metrics.

Data behavior:

  1. Datasets are append-only, and records are immutable.
  2. Every record (independent of dataset) needs to be timestamped.
  3. Records will be 32-bit or 64-bit integers in "simple" datasets, while more "complex" data sets will be vectors of integers between 32-bit and 256-bit each, not exceeding about 1kb per entry.
  4. There will be one primary "large" table, holding 200M or more entries of a "complex" (see previous point) nature.
  5. There will be many (10 < N < 100) small(er) datasets (both "simple" and "complex") with perhaps on-the-order-of millions of records each.

Wishlist:

  1. Starting off on a single host, we really want to avoid complex "Big Data"-y dependencies for the backend (such as HBase), while simpler alternatives would be considered. This takes e.g. OpenTSBD off the table.
  2. Friendly bindings in a high-ish level language. Ruby, Python, PHP, etc. but we can go down to C, C++, Java, etc. if we can't avoid it.
  3. Streaming/pubsub/realtime API preferable.
  4. Custom queries - we'll need more than just simple statistical mean/median/mode/std-dev operations, and it would be great if we can codify our analysis into "native" queries/commands/structures rather than read out all of the data just to calculate everything in the application code.

OpenTSBD is based on HBase, TempoDB won't work on a cost/performance basis, Redis, Mongo, CouchDB, etc all seem like they would choke on this volume of data, and we're left wondering if we're dreaming. Correct me if I'm underestimating any of the mentioned systems (or their contemporaries). Does something like this exist? If not, would we be able to get the job done by yielding on just one of the listed requirements or wishes?

Community
  • 1
  • 1
Chris Tonkinson
  • 13,823
  • 14
  • 58
  • 90
  • See http://www.erol.si/2015/01/the-complete-list-of-all-timeseries-databases-for-your-iot-project/ for an overview on all available time series databases – koppor Jun 12 '15 at 13:54

3 Answers3

2

I wrote an immutable database for time series data in Python using hdf files as an underlying.

All of this is probably not super-fast but you may get the idea from this code fragment

   def write_series(self, group, name, series):
        assert (group in self.groups)

        if not name in self.series(group):
            self.__create_table(group, name)

        table_hdf = self.__group__(group, name)

        times = [row["time"] for row in table_hdf]

        if not times:
            add = series.index
        else:
            add = series.index[series.index > max(times)]

        if len(add) > 0:
            add = sorted(add)
            table_hdf.append([x for x in it.izip(add, series[add])])
            table_hdf.flush()

All of this is now supported also straight out of the box in Pandas. My code is located here:

https://github.com/tschm/pycta

There is also an interesting little book though I haven't read it yet

http://www.amazon.co.uk/Python-HDF5-Andrew-Collette/dp/1449367836/ref=sr_1_1?ie=UTF8&qid=1387485396&sr=8-1&keywords=Python+hdf

Happy storing data Thomas

tschm
  • 2,905
  • 6
  • 33
  • 45
2

Have you tried SciDB? It is designed for processing large-scale scientific data. Additionally, MonetDB's SciQL also has claimed to support such a function, but I haven't used MonetDB.

In your case, all you need in SciDB is called "window aggregation", which allows a sliding window to move along a time dimension and some aggregate statistics to be calculated for each window snapshot. The reasons why SciDB may be attractive to you are as follows:

  1. It is very easy to install a single-host version. It has already been installed on EC2, if you don't want any have trouble in setup.

  2. SciDB mainly supports two interfaces: AFL and AQL. The former is a functional language, and the latter is an SQL-like language. Both are very high-level and declarative. Moreover, SciDB also has a SciDB-R variant, which supports R language.

  3. SciDB does support user-defined functions, so you can customize your ad-hoc aggregation functions.

  4. SciDB is an open-source software, so it's totally free.

wayi
  • 513
  • 1
  • 6
  • 14
  • BTW, I have developed a similar parallel program for processing HDF5 data sets. It is written in C++, and it runs faster than SciDB, but unfortunately no fault tolerance is supported so far. – wayi Dec 20 '13 at 16:07
2

Wishlist:

  1. simple setup: check
  2. bindings in a high-ish level language: check (http://code.kx.com/wiki/Category:Interfaces)
  3. Streaming/pubsub/realtime: check
  4. Custom queries: check (SQL like query language)

=> kdb+ from http://kx.com is what you are looking for.

hellomichibye
  • 4,112
  • 2
  • 23
  • 23