3

I am currently trying to improve the performance of a web application. The goal of the application is to provide (real time) analytics. We have a database model that is similiar to a star schema, few fact tables and many dimensional tables. The database is running with Mysql and MyIsam engine.
The Fact table size can easily go into the upper millions and some dimension tables can also reach the millions.
Now the point is, select queries can get awfully slow if the dimension tables get joined on the fact tables and also aggretations are done. First thing that comes in mind when hearing this is, why not precalculate the data? This is not possible because the users are allowed to use several freely customizable filters.

So what I need is an all-in-one system suitable for every purpose ;) Sadly it wasn't invented yet. So I came to the idea to combine 2 existing systems. Mixing a row oriented and a column oriented database (e.g. like infinidb or infobright). Keeping the mysql MyIsam solution (for fast inserts and row based queries) and add a column oriented database (for fast aggregation operations on few columns) to it and fill it periodically (nightly) via cronjob. Problem would be when the current data (it must be real time) is queried, therefore I maybe would need to get data from both databases which can complicate things.

First tests with infinidb showed really good performance on aggregation of a few columns, so I really think this could help me speed up the application.

So the question is, is this a good idea? Has somebody maybe already done this? Maybe there is are better ways to do it.

I have no experience in column oriented databases yet and I'm also not sure how the schema of it should look like. First tests showed good performance on the same star schema like structure but also in a big table like structure.

I hope this question fits on SO.

Johan
  • 74,508
  • 24
  • 191
  • 319
enricog
  • 4,226
  • 5
  • 35
  • 54
  • Just change your engine to innodb http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html. I'd probably export the data into csv files sorted by primary key, recreate the schema using innodb and then reload the sorted data. – Jon Black Feb 25 '11 at 11:31
  • Thanks, yes we are also considering to change to innodb, especially because of massive parallel read/writes. I also tested a bit with innodb which gave good results, especially on concurrent read/writes. But not really the needed performance boost like with the column oriented databases, which performed about 25+% better on some operations. – enricog Feb 25 '11 at 11:49
  • strange - i have observed the complete opposite - maybe you need to redesign your schema to take advantage of innodb's clustered index http://www.xaprb.com/blog/2006/07/04/how-to-exploit-mysql-index-optimizations/ – Jon Black Feb 25 '11 at 11:57
  • I didn't test that deeply yet (and also not with clustered index in mind), so I will surely have a deeper look into that now. That link also looks good, so thanks again. On my tests the column based databases where superior mainly on column based operations (which surely makes sense) like select distinct column, sum column, count etc. On operations to select xxx-thousands of rows with 6 out of 10 columns from a table they surely where slower. Thats why I was thinking about combining those. – enricog Feb 25 '11 at 12:04

1 Answers1

3

Greenplum, which is a proprietary (but mostly free-as-in-beer) extension to PostgreSQL, supports both column-oriented and row-oriented tables with high customizable compression. Further, you can mix settings within the same table if you expect that some parts will experience heavy transactional load while others won't. E.g., you could have the most recent year be row-oriented and uncompressed, the prior year column-oriented and quicklz-compresed, and all historical years column-oriented and bz2-compressed.

Greenplum is free for use on individual servers, but if you need to scale out with its MPP features (which are its primary selling point) it does cost significant amounts of money, as they're targeting large enterprise customers.

(Disclaimer: I've dealt with Greenplum professionally, but only in the context of evaluating their software for purchase.)

As for the issue of how to set up the schema, it's hard to say much without knowing the particulars of your data, but in general having compressed column-oriented tables should make all of your intuitions about schema design go out the window.

In particular, normalization is almost never worth the effort, and you can sometimes get big gains in performance by denormalizing to borderline-comical levels of redundancy. If the data never hits disk in an uncompressed state, you might just not care that you're repeating each customer's name 40,000 times. Infobright's compression algorithms are designed specifically for this sort of application, and it's not uncommon at all to end up with 40-to-1 ratios between the logical and physical sizes of your tables.

goodside
  • 4,429
  • 2
  • 22
  • 32