35

Recenctly I read some articles online that indicates relational databases have scaling issues and not good to use when it comes to big data. Specially in cloud computing where the data is big. But I could not find good solid reasons to why it isn't scalable much, by googling. Can you please explain me the limitations of relational databases when it comes to scalability?

Thanks.

lahiru madhumal
  • 1,185
  • 2
  • 12
  • 30
  • 6
    Define "not scalable". Plenty of Fish and Stack Overflow use relational databases and they get millions of hits _per day_. – Oded Aug 31 '12 at 12:05
  • 6
    My point with above is that many people who say that relational database don't scale are the same ones who don't know how to use them effectively. – Oded Aug 31 '12 at 12:05
  • 2
    @Oded Yes. I see you've got a point. Sites like stack overflow get millions of hits per day and clearly relational databases have the ability to handle it. But I am trying to clarify myself is, may be the problem here is with the efficiency or may be cost etc... That is what I want to know. I am just trying to keep an open mind ;) – lahiru madhumal Aug 31 '12 at 12:19
  • 1
    To address your question. Relational database support [ACID](http://en.wikipedia.org/wiki/ACID) - this is expensive. Many NoSQL solutions remove one or more of the constraints ACID imposes (so, a trade off between ACID and speed). – Oded Aug 31 '12 at 12:21
  • Possible duplicate of [Why NoSQL is better at "scaling out" than RDBMS?](https://stackoverflow.com/questions/8729779/why-nosql-is-better-at-scaling-out-than-rdbms) – philipxy Jul 17 '19 at 19:50

3 Answers3

30

Imagine two different kinds of crossroads.

One has traffic lights or police officers regulating traffic, motion on the crossroad is at limited speed, and there's a watchdog registering precisely what car drove on the crossroad at what time precisely, and what direction it went.

The other has none of that and everyone who arrives at the crossroad at whatever speed he's driving, just dives in and wants to get through as quick as possible.

The former is any traditional database engine. The crossroad is the data itself. The cars are the transactions that want to access the data. The traffic lights or police officer is the DBMS. The watchdog keeps the logs and journals.

The latter is a NOACID type of engine.

Both have a saturation point, at which point arriving cars are forced to start queueing up at the entry points. Both have a maximal throughput. That threshold lies at a lower value for the former type of crossroad, and the reason should be obvious.

The advantage of the former type of crossroad should however also be obvious. Way less opportunity for accidents to happen. On the second type of crossroad, you can expect accidents not to happen only if traffic density is at a much much lower point than the theoretical maximal throughput of the crossroad. And in translation to data management engines, it translates to a guarantee of consistent and coherent results, which only the former type of crossroad (the classical database engine, whether relational or networked or hierarchical) can deliver.

The analogy can be stretched further. Imagine what happens if an accident DOES happen. On the second type of crossroad, the primary concern will probably be to clear the road as quick as possible, so traffic can resume, and when that is done, what info is still available to investigate who caused the accident and how ? Nothing at all. It won't be known. The crossroad is open just waiting for the next accident to happen. On the regulated crossroad, there's the police officer regulating the traffic who saw what happened and can testify. There's the logs saying which car entered at what time precisely, at which entry point precisely, at what speed precisely, a lot of material is available for inspection to determine the root cause of the accident. But of course none of that comes for free.

Colourful enough as an explanation ?

Erwin Smout
  • 18,113
  • 4
  • 33
  • 52
  • 11
    On the unregulated road, you handle more traffic by just increasing the width of the road. On the regulated road, you have to get a new policeman, new traffic lights, cameras e.t.c... And not the complicated part: The two police men and traffic lights must work in cordination. – joshua Feb 10 '13 at 14:44
  • 2
    Is this answer explaining the scalability issue? Which part did I miss? – user2652379 Dec 28 '18 at 02:04
19

Relational databases provide solid, mature services according to the ACID properties. We get transaction-handling, efficient logging to enable recovery etc. These are core services of the relational databases, and the ones that they are good at. They are hard to customize, and might be considered as a bottleneck, especially if you don't need them in a given application (eg. serving website content with low importance; in this case for example, the widely used MySQL does not provide transaction handling with the default storage engine, and therefore does not satisfy ACID). Lots of "big data" problems don't require these strict constrains, for example web analytics, web search or processing moving object trajectories, as they already include uncertainty by nature.

When reaching the limits of a given computer (memory, CPU, disk: the data is too big, or data processing is too complex and costly), distributing the service is a good idea. Lots of relational and NoSQL databases offer distributed storage. In this case however, ACID turns out to be difficult to satisfy: the CAP theorem states somewhat similar, that availability, consistency and partition tolerance can not be achieved at the same time. If we give up ACID (satisfying BASE for example), scalability might be increased. See this post eg. for categorization of storage methods according to CAP.

An other bottleneck might be the flexible and clever typed relational model itself with SQL operations: in lots of cases a simpler model with simpler operations would be sufficient and more efficient (like untyped key-value stores). The common row-wise physical storage model might also be limiting: for example it isn't optimal for data compression.

There are however fast and scalable ACID compliant relational databases, including new ones like VoltDB, as the technology of relational databases is mature, well-researched and widespread. We just have to select an appropriate solution for the given problem.

user6039980
  • 3,108
  • 8
  • 31
  • 57
csaba
  • 680
  • 5
  • 7
  • 2
    "These cannot be turned off". That is a blatant lie. DB2 allows to turn journaling (logging) off (and I 'd be surprised if any of the other big dogs lacked the equivalent in their products). And guess what, if you do that your update programs may get to run as much as twice as fast. Of course the price you pay is taking backups before such an update run, and the time it takes to restore if the program fails. Of course this typically isn't done. – Erwin Smout Sep 04 '12 at 11:23
  • 1
    Yes, "cannot" might be too strong here. I don't know all DBs; however, for example use of Oracle nologging clause only reduces log size, but does not turn it off. Transaction handling and writing undo information definitively cannot be turned off, or, if turned off, the DB is not ACID compliant any more. Am I wrong? And one more bottleneck: the data model and SQL. Flexible model with clever algorithms; in lots of cases a simpler model with simpler operations would be sufficient and more efficient (like untyped key-value stores). – csaba Sep 04 '12 at 22:42
2

Take the simplest example: insert a row with generated ID. Since IDs must be unique within table, database must somehow lock some sort of persistent counter so that no other INSERT uses the same value. So you have two choices: either allow only one instance to write data or have distributed lock. Both solutions are a major bottle-beck - and is the simplest example!

Tomasz Nurkiewicz
  • 334,321
  • 69
  • 703
  • 674
  • 2
    Interesting [read](http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram) on how Instagram tackles the ID generation problem – Kermit Sep 02 '12 at 23:12
  • 2
    @Tomasz, ... or just use different sets of identifiers for different instances (e.g. with a distinct prefix code or different ranges of values). This really is not a hard problem in a relational database! – nvogel Sep 03 '12 at 19:06
  • @Tomasz Nurkiewicz. I just wanna konw how NoSQL can handle this issue. It's data model can be able to do this ?? – nathan Dec 20 '16 at 19:14