Distributed SQL Database

Question

I need to decide which database to use for a system where I need AP from CAP theorem. Data I constantly but slowly going in. Big queries are expected. It should be reliable - no single point of failure. I can use up to 3 instances on different nodes. In-memory solutions are bad for me because of data size - it should be running for years and I expect up to terabyte data sizes. Most guys in my team prefer SQL. But I understand that traditional SQL databases are not fault tolerant in terms of hardware failure. Any ideas?

https://www.getyarn.io/yarn-clip/ac5b054e-4f2a-4e15-a810-094a2553fe44 — John Boker, Dec 19 '16 at 14:46
Pretty complex in configuration and different nodes with different roles. Maybe some other solution? — Pavlo, Dec 19 '16 at 18:03
With your requirements maybe your developers who prefer SQL need to learn something new. — John Boker, Dec 19 '16 at 18:13
Strongly agree but the task is very well mapped on relational database since there will be multiple join statements. — Pavlo, Dec 19 '16 at 18:16
There are list of nodes for each node there is historical data which is represented as a list of requests and each request has associated list of parameters and their values. — Pavlo, Dec 19 '16 at 18:17
Nodes can be added any time, they are pulled for info using schedule or by user request. — Pavlo, Dec 19 '16 at 18:18
AWS's Redshift is a great system. It uses a PostgreSQL language and is very fast and can handle terabytes of data no problem. Cassandra also has a SQL-like language, but it's not a standard implementation and has a bit of a learning curve. If you need a cheaper solution, I'd recommend storing flat files Parquet (or similar) format in a DFS and using Spark or Hive to query them. — kuujo, Dec 20 '16 at 20:53
We think we could also consider Cockroach. It is still in beta but looks promising. Another solution that we consider is to switch from SQL and in this case use rethinkdb. — Pavlo, Dec 22 '16 at 07:40

score 3 · Answer 1 · answered Oct 05 '20 at 20:05

3

Since this question was asked there have been some significant changes in the Distributed SQL or NewSQL landscape...the most noteworthy being the viability of CockroachDB. That appears to be the best option in a situation like the one referenced in this question.

No single points of failure. Easy to scale. Can handle tons of volume. You can run it wherever you want. Speaks postgres. Super fault tolerant.

answered Oct 05 '20 at 20:05

Daniel Kelly

31
2

Hi Daniel, thanks for that answer. This topic also has many related answers on SO. I recommend that you also add a link to one of the more popular answers, here: https://stackoverflow.com/q/1145726/212950 – MAbraham1 Oct 05 '20 at 21:14

score 0 · Accepted Answer · answered Jan 05 '17 at 07:21

0

Amazon Redshift seems to be the best answer(thank you kuujo). But we will try rethinkdb because it has some nice feature

answered Jan 05 '17 at 07:21

Pavlo

1,594
2
15
30

Distributed SQL Database

2 Answers2