12

For one of my projects, I have to enter a big-ish collection of events into a database for later processing and I am trying to decide which DBMS would be best for my purpose.

I have:

  • About 400,000,000 discrete events at the moment

  • About 600 GB of data that will be stored in the DB

These events come in a variety of formats, but I estimate the count of individual attributes to be about 5,000. Most events only contain values for about 100 attributes each. The attribute values are to be treated as arbitrary strings and, in some cases, integers.

The events will eventually be consolidated into a single time series. While they do have some internal structure, there are no references to other events, which - I believe - means that I don't need an object DB or some ORM system.

My requirements:

  • Open source license - I may have to tweak it a bit.

  • Scalability by being able to expand to multiple servers, although only one system will be used at first.

  • Fast queries - updates are not that critical.

  • Mature drivers/bindings for C/C++, Java and Python. Preferrably with a license that plays well with others - I'd rather not commit myself to anything because of a technical decision. I think that most DB drivers do not have a problem here, but it should be mentioned, anyway.

  • Availability for Linux.

  • It would be nice, but not necessary, if it was also available for Windows

My ideal DB for this would allow me to retrieve all the events from a specified time period with a single query.

What I have found/considered so far:

  • Postgresql with an increased page size can apparently have up to 6,000 columns in each table. If my estimate of the attribute count is not off, it might do.

  • MySQL seems to have a limit of 4,000 columns per table. I could use multiple tables with a bit of SQL-fu, but I'd rather not.

  • MongoDB is what I am currently leaning towards. It would allow me to preserve the internal structure of the events, while still being able to query them. Its API also seems quite straight-forward. I have no idea how well it does performance-wise though - at least on a single server.

  • OpenTSDB and its metric collection framework sounds interesting.I could use a single time series for each attribute (which might help with some of my processing), have the attribute value as a tag and additionally tag the entries to associate them to a specific event. It probably has a steeper preparation curve that the three above, both from an administrator and an application programmer point of view. No idea about its performance.

  • Use HBase directly. This might fit my requirements better than OpenTSDB, although - judging from my past experience with hadoop - the administration overhead is probably still higher than the first three options.

There are probably other databases that could do it, so feel free to let me know - I would appreciate any suggestion or comment that might help me with this.

PS: I only have minimal experience as a DB administrator, so I apologise for any misconceptions.

thkala
  • 84,049
  • 23
  • 157
  • 201
  • Most (all?) SQL database management systems also have a limit on the number of bytes in a row. Depending on the particular dbms, it might be a hard limit (can't create a table that might store more than 8k bytes in a row) or a soft limit (some columns might be moved to an alternate storage location within the db, which affects performance). – Mike Sherrill 'Cat Recall' Feb 10 '11 at 22:27

2 Answers2

6

Using tables with thousands of columns is madness. Especially when most of them are zero as you said.

You should first look into converting your data-structure from this:

table_1
-------
event_id
attribute_1
attribute_2
[...]
attribute_5000

into something like this:

table_1          event_values             attributes
--------         ------------             ----------
event_id         event_id                 attribute_id
                 attribute_id             attribute_type
                 attribute_value

which can be used with any RDMS (your only constraint then would be the total database size and performance)

grimmig
  • 1,391
  • 2
  • 14
  • 24
  • I ended up using MongoDB for a variety of reasons, performance and ease of use being the most important. In any case, the schema that you proposed is a basic ORM pattern that should be usable with any relational DB, which is why I will accept this answer. – thkala Oct 29 '11 at 12:12
0

It is probably very late for an answer, but here is what I do.

I use HDF5 as my time series repository. It has a number of effective and fast compression styles which can be mixed and matched. It can be used with a number of different programming languages. It is available on Windows as well as Linux.

I use boost::date_time for the timestamp field. This allows a large variety of datetime based computations.

In the financial realm, I then create specific data structures for each of bars, ticks, trades, quotes, ...

I created a number of custom iterators and used standard template library algorithms to be able to efficiently search for specific values or ranges of time-based records. The selections can then be loaded into memory.