5

I am building system to analyze large quantities of financial data regarding securities trading prices. A large challenge in this is determining what storage method to use for the data given that the data will be in the 10's of terrabytes. There will be many queries on the data such as taking averages, calculating standard deviations, and sums filtered by multiple columns such as price, time, volume, etc. Join statements aren't a requisite, but would be nice to have.

Right now, I am looking at infobright community edition, monetdb, and greenplum community edition for evaluation purposes. They seem great so far, but for more advanced features, some of each are required are not available in some of these editions (using multiple servers, insert/update statements, etc).

What solutions would you use for this situation, and benefits does it provide over the alternatives? Being cost effective is a major plus. If I must pay for a data warehousing solution I will, but I would much rather avoid it and take the open-source/community edition route if possible.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
user396404
  • 2,759
  • 7
  • 31
  • 42
  • 1
    Actually (parial) sums, averages, stddevs - even when filtered - are not data mining. They're just plain **statitics**. – Has QUIT--Anony-Mousse Mar 24 '12 at 07:40
  • 1
    Data mining is the discovery of new information from a set of data. Statistics are a tool to aid in this process. I do not think such emphasis on terminology is conducive, and if we argue about it, will not accomplish anything. It is like me saying I disagree with your post because you mispelled statistics as "statitics" - it's unnecessary emphasis on technicality. My goal is to use this database to discover new information. Additionally, data warehousing solutions are built with data mining in mind. Therefore, the concept of data-mining is applicable here. – user396404 Mar 24 '12 at 18:26
  • However, I do agree with you that "database" is probably a better tag for this post. – user396404 Mar 24 '12 at 19:29

2 Answers2

1

Infobright delivers fast query performance with no tuning, no projections and no indexes on large volumes of data. On data loading, i have seen instances where 80TB of data per hour can load, over 12,000 inserts per second.

How does it work?

  1. Column Orientation vs Row Orientation
  2. Data Packs plus Compression average of 20:1
  3. Knowledge Grid - Sub second response on query
  4. Granular Engine, built on top of mysql architecture

I would still suggest you consider looking into the enterprise licensing, but you can certainly evaluate the community edition and test your performance and data loading needs against it.

Disclaimer: author is affiliated with Infobright.

Community
  • 1
  • 1
Craig Trombly
  • 464
  • 2
  • 9
1

I think any of the databases you mention will do what you require. If you're dealing with 10's of TB worth of data getting an enterprise license to run in an MPP cluster would probably be a good use of funds to keep processing time down. Also, if this DW will be doing important processing for your organization, having a license means you get support from the vendor, which is important to a lot of businesses. YMMV.

A more important question would be what are your data ingest rates going to be looking like? For a financial system I think a large part of the equation should be the ability to keep loading fresh data into your system while continuing to do your normal processing.

I am only familiar with Greenplum out of your list of candidates but I know that it does well at loading large amounts of data in a short amount of time. GP also has a lot of built-in statistical and analytic functions you can run natively inside the DB including built-in SQL functions, MADLib, R, etc.

Bart K
  • 684
  • 5
  • 10