What Keeps Relational Databases From Horizontal Scaling?

Question

When I researched horizontal scaling for relational databases on the internet, I got the impression that the only option which includes write scaling as well as read scaling is sharding, which seems to be a manual design process that involves complex application specific configurations and is hard to maintain if you need to change your sharding structure.

On the other hand, NoSQL seems to be natively supporting horizontal scaling but it has the drawback of not supporting transactions, ACID etc.

One other concept that seems to have been popular recently is NewSQL databases. And these databases promise to hit the sweet spot by being both ACID compliant and able to horizontally scale, either by automatic sharding or some other innovative architecture.

My question is, if we are using SAN with our relational database, isn't adding more database servers to the cluster and more disks to the SAN going to achieve horizontal scaling? (Adding disks will increase total disk IOPS and throughput as well as disk space.) What will be the bottleneck there so that we need to use a NewSQL database to achieve both ACID and horizontal scaling?

score 6 · Answer 1 · edited Jun 20 '20 at 09:12

6

Horizontal scaling in relational databases is hard to achieve because when you have tables (or shards of the same table) across the different cluster nodes, joins usually become very inefficient. Additionally, there is a problem of replication and keeping ACID guarantees while ensuring that all replicas have fresh data. However, there is a RDBMS that scales horizontally - MySQL Cluster. From the docs:

MySQL Cluster automatically shards (partitions) tables across nodes, enabling databases to scale horizontally on low cost..

Auto-Sharding in MySQL Cluster

Unlike other sharded databases, users do not lose the ability to perform JOIN operations, sacrifice ACID-guarantees or referential integrity (Foreign Keys) when performing queries and transactions across shards.

In my company, We have been using MySQL Cluster for quite some time and it really works well (and scales horizontally). There is also Citus (recently released) that is built on the top of PostgreSQL, but haven't tried it personally.

edited Jun 20 '20 at 09:12

Community

1
1

answered Feb 16 '18 at 12:03

Miljen Mikic

14,765
8
58
66

2

You do lose availability though https://planet.mysql.com/entry/?id=32245 A scaledg MySQL cluster means one node out means whole cluster goes down. Which is why other RDBMS scale up not out – gbn Feb 16 '18 at 12:14
2

@gbn There is always a choice between consistency and availability due to CAP theorem, as correctly stated in your answer. However, it is **not** true that a failure of a single node in MySQL Cluster implies that whole cluster goes down. A cluster of 4 nodes can handle up to 2 failed nodes. See also: https://www.mysql.com/products/cluster/availability.html – Miljen Mikic Feb 16 '18 at 12:27
@MiljenMikic What I wonder particularly is that, when SAN is scaled out by adding new disks (without any sharding or data partitioning), database server instances will actually see a single scaled up data storage, aren't they? Because SAN will transparently represent all those physical disks as a single virtual storage. So, I assume it will be equivalent to a local node being scaled up by more disk space and throughput. That's why I don't quite understand where the bottleneck will be. Is there a flaw in my assumption? – John L. Feb 16 '18 at 13:11
@JohnL. It will be equivalent to a local node being scaled, but just adding new disks is not horizontal scaling. In your question you say "..sharding, which seems to be a manual design process that involves complex application specific configurations", but I provided you an example of RDBMS that does auto sharding and hopefully, answered the question from the title. – Miljen Mikic Feb 16 '18 at 13:24

score 2 · Answer 2 · answered Feb 16 '18 at 11:48

The answer is "CAP Theorem"

You can have at most 2 of Consistency, Availability or Partition Tolerance but typically it boils down to

(Consistency OR availability) AND Partition Tolerance

Database systems designed with traditional ACID guarantees in mind such as RDBMS choose consistency over availability, whereas systems designed around the BASE philosophy, common in the NoSQL movement for example, choose availability over consistency.[6]

With NoSQL if a node drops out the system stays up, but you may not get the latest data. This of course is a huge no-no in, say, banking or billing systems. But in a Social Media application it is of no consequence.

More examples

From this site

CAP theorem - Availability and Partition Tolerance

What I wonder particularly is that, when SAN is scaled out by adding new disks (without any sharding or data partitioning), database server instances will actually see a single scaled up data storage, aren't they? Because SAN will transparently represent all those physical disks as a single virtual storage. So, I assume it will be equivalent to a local node being scaled up by more disk space and throughput. That's why I don't quite understand where the bottleneck will be. Is there a flaw in my assumption? — John L., Feb 16 '18 at 13:11
@JohnL. Please edit clarifications into your post, not comments. — philipxy, Feb 17 '18 at 18:01

What Keeps Relational Databases From Horizontal Scaling?

2 Answers2