Scaling in nosql vs rdbms?

Question

I am trying to understand the architectural difference in nosql and relational databases in the context of scalability.

My understanding of scalability(horizontal) is that as our data grows, we add more and more server to split the load evenly.

In key-value NO-SQL databases, we can add the new machine and split the keys. However, all of examples I have seen so far to understand eventual consistency in NO-SQL databases, they all have master-slave configuration where data is replicated across all the slaves instead of splitting across various machine to achieve scalability.

My question is doesn't using replicating your whole data defeat the whole point of scalability in No-SQL databases? The same can be done in RDBMS as well, with one master(for write) and slaves(for reads), how is NO-SQL more scalable in this regard?

Possible duplicate here: http://stackoverflow.com/questions/8729779/why-nosql-is-better-at-scaling-out-than-rdbms although they only scratch the surface (mostly in favor of nosql) — Simon Mourier, Dec 30 '15 at 14:43

saljuama · Answer 1 · 2015-12-31T21:40:32.563

The goal of scalability is to increase the overall capacity of a given application, and can can be vertical (bigger machines) or horizontal (add more machines). When it comes to horizonal scaling, you can add more machines, but as the number of machines increase, so does the probability of a failure of a node in the cluster, which is something to keep in mind.

When you add more nodes, what you can do is either split the data, which is called sharding, and you can also duplicate the data, which is called replication.

Replication

With replication, the usual architecture is master-slave, where you can only write to the master, who replicates the data to the slaves, so this means that you cannot use replication to split the writes to the cluster, but it is possible to split the reads, deppending on the consistency level (not all of the NoSQL technologies provide the same level) and the cluster configuration.

Sharding

Sharding is more suited to provide scaling, as you split the data set in multiple parts with similar sizes if possible. This clearly allows the benefit to split reads and writes to different nodes. In order to make it work, some mechanisms need to be in place:

routing: to locate in which shard is the data is stored, or to decide in which shard to write
balancing: to maintain the sizes of the different fragments of the dataset with a similar size over time.

But usually these mechanisms are provided by the database vendor, so no need to worry about providing it, but is still necessary to understand to manage the cluster.

The problem here is, as I mentioned at first, the more nodes a cluster has, the higher is the chance to have a failure on a given node, meaning that if a node with a part of the data set goes offline, part of the data won't be available, which is not a desirable scenario. But luckyly, sharding and replication are not exclusive, is it possible to build a sharded cluster, where each shard is a cluster on with replication in place.

But in order to answer your questions

doesn't using replicating your whole data defeat the whole point of scalability in No-SQL databases?

In master-slave architectures, you cannot split the writes, but you can split the reads, which is somewhat a way to scale, although, the main purpose is high availability.

Anyways, there are new emergent databases that start to provide multi-master architecture, where all the nodes act as master, meaning all can receive both, writes and reads.

The same can be done in RDBMS as well, with one master(for write) and slaves(for reads), how is NO-SQL more scalable in this regard?

In a single node environment, NoSQL is already faster than RDBMS when there are JOINS involved, as it is an expensive operation, or when there are a lot of integrity checks involved.

So, when you try to shard the dataset in a RDBMS, unless really carefully designed, the most likely scenario will be the desired data located in different shards. This means that the JOIN and the integrity checks need to be performed between different nodes, making them even more expensive operations than they already are.

This means that RDBMS databases, use mechanisms that act as contraints when you intent to scale horizontally, which NoSQL doesn't. Yes, you can still scale RDBMS horizontally, but overall will be more expensive than using NoSQL databases.

Update: special mention for graph databases

Sharding in graph databases is really difficult, since mathematically, the problem of distributing a large graph between different servers is NP complete. And also, when having to query data among different shards, one of the main features of graphs is lost, fast transverses.

I've seen 2 main approaches that graph databases follows to scale horizontally:

1) Let the application/developer decide how to partition the graph, which you can imagine how complex this can be.

2) Replicate all the graph in all the nodes and use cache-sharding, which means, that all the nodes have the entire data set, but each node maintains in memory the part of the graph that is most queried for that node in particular.

I guess, that in the future, graph database companies will develop more solutions to address this issue.

Related to your question, at their current state, graph databases can still outscale RDBMS when it comes to horizontal scaling, due to the lack of the RDBMS contraints, but its hard to compare between different NoSQL database types.

score 3 · Answer 2 · answered Dec 30 '15 at 18:37

3

the master-slave configuration in NoSQL databases is for High-Avaibility purposes and data consistency, not to confound with the purpose of scalability wich is to load balance the workload.

answered Dec 30 '15 at 18:37

Soufiaane

1,737
6
24
45

1

So if replication across multiple server assists with high availability, how do we achieve scalability/load balancing? – Max Dec 31 '15 at 03:48

user31986 · Answer 3 · 2015-12-30T19:08:50.923

In NoSQL, so far as splitting the keys is concerned, only the Master copies matter. Slaves are for HA and Availability in general. This replication, in fact, is responsible for the eventual consistency -- you will get the data right away, may be it is not the most updated one, but eventually you will have the updated data.

On the other hand, RDBMS will have slower data access/modification because it has to follow ACID properties, and mostly that is with strong-consistency.

Replication is not the differentiating factor, as such, between NoSQL and RDBMS, adherence/non-adherence to ACID properties is. Nor does Scalability mean absence of extra copies. Hope that answers.

score 2 · Answer 4 · answered Dec 30 '15 at 19:12

2

To answer your question, replicating your data does not defeat the point of scalability.

Scalability roughly refers to the ability to grow a database and is not necessarily tied to having more copies of the database.

More servers with the database information on them allows for more readily available access to it to more users, as the other answers have stated.

I believe this might be a misunderstanding between Scalability and Availability?

answered Dec 30 '15 at 19:12

dak1220

306
4
13

1

Ok, I understand more copies == high availability so that if one goes down, we still have others, how do we achieve high scalability in rdbms and no-sql? – Max Dec 31 '15 at 03:51
I am currently a student and don't have much experience implementing these systems, I have only been taught the theory. I have found another stack exchange post which might help you more than I could. The answers in this thread are much more detailed and well structured than what I could muster: http://programmers.stackexchange.com/questions/194340/why-are-nosql-databases-more-scalable-than-sql – dak1220 Dec 31 '15 at 13:23

score 0 · Answer 5 · answered Jan 17 '16 at 17:55

If we just consider key-value databases vs SQL databases then the former ones are more suitable for scalability than the latter ones.

This is because a key-value store doesn't have transactions. So your only guarantee is that you can atomically change the one value for the one key. Which results in ease of scaling.

For example you just hash a key and store a key-value pairs on a machine corresponding to the hash of the key.

You can't do the same thing to the SQL database without loosing an ACID (atomicity, consistency, isolation, durability of a transaction) property. Moreover you can't even easily perform a join SELECT if you store different tables or different parts of a table on different machines.

So in general NoSQL databases are more prepared for sharding across many machines than SQL ones.

Scaling in nosql vs rdbms?

5 Answers5

Replication

Sharding