7

I'm searching for any NoSQL system (preferably open source) that supports analytic functions (AF for short) like Oracle/SQL Server/Postgres does. I didn't find any with build-in functions. I've read something about Hive but it doesn't have actual feature of AF (windows, first_last values, ntiles, lag, lead and so on) just histograms and ngrams. Also some NoSQL systems (Redis for example) support map/reduce, but I'm not sure if AF can be replaced with it.

I want to make a performance comparison to choose either Postgres or NoSQL system.

So, in short:

  1. Searching for NoSQL systems with AF
  2. Can I rely on map/reduce to replace AF? Is it fast, reliable, easy to go.

ps. I tried to make my question more constructive.

ravnur
  • 2,772
  • 19
  • 28

2 Answers2

2

Once you've really understood how MapReduce works, you can do amazing things with a few lines of code.

Here is a nice video course:

http://code.google.com/intl/fr/edu/submissions/mapreduce-minilecture/listing.html

The real difficulty factor will be between functions that you can implement with a single MapReduce and those that will need chained MapReduces. Moreover, some nice MapReduce implementations (like CouchDB) don't allow you to chain MapReduces (easily).

Aurélien Bénel
  • 3,775
  • 24
  • 45
1

Some function uses knowledge of all existing data when it involves some king of aggregation (avg, median, standard deviation) or some ordering (first, last).

If you want a distributed NOSQL solution that support AF out of the box, the system will need to rely on some centralized indexing and metadata to keep information about the data in all nodes, thus having a master-node and probably a single point of failure.

You have to ask what you expect to accomplish using NoSQL. You want schemaless tables ? Distributed data ? Better raw performance for very simple queries ?

Depending of your needs, I see three main alternatives here:

1 - use a distributed NoSQL with no single point of failure (ie: Cassandra) to store your data and use map/reduce to process the data and produce the results for the desired function (almost any major NoSQL solution support Hadoop). The caveat is that map/reduce queries are not realtime (can take minutes or hours to execute the query) and requires extra-setup and learning.

2 - use a traditional RDBMS that support multiple servers like MySQL Cluster

3 - use a NoSQL with master/slave topology that supports ad-hoc and aggregation queries like Mongo

As for the second question: yes, you can rely on M/R to replace AF. You can do almost anything with M/R.

lstern
  • 1,599
  • 14
  • 27
  • You can indeed compute average on a distributed architecture, but to do this you need to store average along with count. – Aurélien Bénel Nov 09 '12 at 09:46
  • @Istern, yes you are right. I'd rather interesting can I rely on map/reduce to replace built-in analytic functions (i mention it in second agenda) – ravnur Nov 09 '12 at 15:40
  • @ravnur edited the answer to point that out in an more explicit way (yes, you can rely on MR) – lstern Nov 09 '12 at 15:46
  • @ravnur you are welcome. I recommend that you take a closer look to the map reduce mechanics. It may be a little strange at first but the concept became really simple once you get it. – lstern Nov 09 '12 at 15:54