I have TBs of structured data in a Greenplum DB. I need to run what is essentially a MapReduce job on my data.
I found myself reimplementing at least the features of MapReduce just so that this data would fit in memory (in a streaming fashion).
Then I decided to look elsewhere for a more complete solution.
I looked at Pivotal HD + Spark because I am using Scala and Spark benchmarks are a wow-factor. But I believe the datastore behind this, HDFS, is going to be less efficient than Greenplum. (NOTE the "I believe". I would be happy to know I am wrong but please give some evidence.)
So to keep with the Greenplum storage layer I looked at Pivotal's HAWQ which is basically Hadoop with SQL on Greenplum.
There are a lot of features lost with this approach. Mainly the use of Spark.
Or is it better to just go with the built-in Greenplum features?
So I am at the crossroads of not knowing which way is best. I want to process TBs of data that fits the relational DB model well, and I would like the benefits of Spark and MapReduce.
Am I asking for too much?