I have to perform some serious data mining on very large data sets stored in MySQL db. However, queries that require a bit more than a basic SELECT * FROM X WHERE ...
tend to become rather inefficient since they return results on the order of 10e6 or more, especially when JOIN
on one or more tables is introduced - think of joining 2 or more tables containing several tens of millions rows (after filtering data), which is something that pretty much happens on every query. More than often we'd like to run aggregate functions on these (sum
, avg
, count
, etc), but this is impossible since MySQL simply chokes.
I should note that many efforts were put to optimize the current performance - all tables are indexed properly and queries are tuned, the hardware is top notch, the storage engine was configured and so on. However, still each query takes very long - to the point where "let's run it before we go home and hope for the best when we come to work tomorrow." Not good.
This has to be a solvable problem - many large companies perform very data and computational intensive mining, and handle it well (without writing their own storage engines, google). I'm willing to accept time penalty to get the job done, but on the order of hours, not days. My question is - what do people use to counter problems like this? I've heard of storage engines geared to this type of problem (greenplum, etc.), but I wanted to hear how this problem is typically approached. Our current data store is obviously relational and should probably remain such, but any thoughts or suggestions are welcome. Thanks.