5

I've been reading a bit lately on document-based databases vs. key-value stores (Here's a good overview Difference between Document-based and Key/Value-based databases? ) and I'm having trouble finding good info on the following.

If we query either of these with the key (or an additional index), there's no real difference in the mechanics - get the value. I'm not clear on how a document store is that different from a key-value store when querying non-indexed documents/fields. If I were to implement a document store on top of a key-value store, I'd do a 'table scan' (check all key/value pairs) for the appropriate value in the query - do document stores do more than this under the covers? Is it appropriate to think of document data stores in this fashion?

This is less of a practical question (would I use Mongo over a BDB if I needed to do something useful, most likely) than one aimed at understanding the underlying technology. I'm interested in the scaling aspects of particular systems only if they are applicable to the underlying implementation.

Community
  • 1
  • 1
dfb
  • 13,133
  • 2
  • 31
  • 52

3 Answers3

1

MongoDB and CouchDB use standard JSON (or BSON (spec)) to store data. They have optimized algorithms when you are querying for a particular value of an object and as far as my knowledge goes, they use Binary Trees for optimization with indexes (MongoDB certainly does). Using these, they can locate the data incomparably faster than searching in the values in a key-value pair database.

(From the key-value pair database implementations, Redis has a very interesting way of increasing performance where it stores the data on memory with few disk I/O.)

Edit:

Came by a great video in which the internals of the MongoDB is explained. Check it out.

Ege Özcan
  • 13,971
  • 2
  • 30
  • 51
  • Small detail: I believe MongoDB uses BSON for storage, not JSON. – nafisto Jul 11 '11 at 16:16
  • Oh, you're certainly right about that, as it allows you to store binary data too. Editing... – Ege Özcan Jul 11 '11 at 16:18
  • Also, do you know whether either of those also have optimized algorithms for NON-INDEXED fields? I believe the OP is wondering whether there are optimized algorithms in a doc store that are significantly better than just retrieving every key-value in a flat store and parsing the values to match the query. – nafisto Jul 11 '11 at 16:19
  • AFAIK, They merge db operations making use of memory and gain advantage of the solid state disks' ability to retrieve old blocks while appending data to a file. I can also guess some white space aligning on some more primitive systems which work with a single model but I have no references and therefore haven't included these in my answer. – Ege Özcan Jul 11 '11 at 16:25
0

All of them use BTree and hash indicies to speed up certain queries. The key value store is basically simply accessing the key which depending on the engine might be regarded as a single value (allowing selection and range queries) or as composite.

Document based engines add support for element paths within the document (or whatever they conceptionally call it). Basically you can emulate a key value store by creating a document {key, value} out of the key value. If you only use to query for documents using the key structure you basically have the same result and similar optimizations in terms of look up.

To find information about mongoDB's internals you might use their site and search for internals (https://www.mongodb.com/search?search=internals). Plenty of information can be found.

Martin Kersten
  • 5,127
  • 8
  • 46
  • 77
-3

Interest on scalability means you have to carefully consider the usage scenario on the design. There are multiple variables to take into account for an scalable NonSQL deployment that spans whether the underlaying implementation is Key-based or Document-oriented. Here's a short list:

Aspects to take into account:

-Frequency of write vs read ops

-Need for data analysis

-Data redundancy for high availability

-Data replication / synchronization

-Need for many transient data

-Data size

-Cloud-ready

Some NonSQL implementations encourage better these aspects by separately than others.

Scenarios:

-Frequently-written, rarely read data like web hit counters, or data from logging devices: Redis | MongoDB

-Frequently-read, rarely written/updated: Memcached for transient data caching, Cassandra | HBase for searching, and Hadoop and Hive for data analysis

-High-availability applications which demand minimal downtime do well with clustered, redundant data stores: Riak | Cassandra

-Data synchronization across multiple locations: CouchDB

-Transient data (web sessions & caches) do well in transient key-value data stores: Memcached

-Big data arising from business or web analytics that may not follow any apparent schema: Hadoop

Conclusion:

IMHO you should focus the problematic of choosing an scalable data-store starting from the usage scenario instead of the underlaying aspects and differences between them.

I also recommend you to check Couchbase which is a nice combination of the two worlds: key-based and document-oriented.

  • Sorry, but this doesn't answer my question at all - I'm not choosing a datastore, the question was specifically about implementation and I don't really care about scalability. – dfb Jul 08 '11 at 22:16