0

I need to decide which database to use for a system where I need AP from CAP theorem. Data I constantly but slowly going in. Big queries are expected. It should be reliable - no single point of failure. I can use up to 3 instances on different nodes. In-memory solutions are bad for me because of data size - it should be running for years and I expect up to terabyte data sizes. Most guys in my team prefer SQL. But I understand that traditional SQL databases are not fault tolerant in terms of hardware failure. Any ideas?

Pavlo
  • 1,594
  • 2
  • 15
  • 30
  • I need horizontal scaling as I understand – Pavlo Dec 19 '16 at 14:44
  • 2
    https://www.getyarn.io/yarn-clip/ac5b054e-4f2a-4e15-a810-094a2553fe44 – John Boker Dec 19 '16 at 14:46
  • http://www.postgres-xl.org/ –  Dec 19 '16 at 14:54
  • Pretty complex in configuration and different nodes with different roles. Maybe some other solution? – Pavlo Dec 19 '16 at 18:03
  • With your requirements maybe your developers who prefer SQL need to learn something new. – John Boker Dec 19 '16 at 18:13
  • Strongly agree but the task is very well mapped on relational database since there will be multiple join statements. – Pavlo Dec 19 '16 at 18:16
  • There are list of nodes for each node there is historical data which is represented as a list of requests and each request has associated list of parameters and their values. – Pavlo Dec 19 '16 at 18:17
  • Nodes can be added any time, they are pulled for info using schedule or by user request. – Pavlo Dec 19 '16 at 18:18
  • AWS's Redshift is a great system. It uses a PostgreSQL language and is very fast and can handle terabytes of data no problem. Cassandra also has a SQL-like language, but it's not a standard implementation and has a bit of a learning curve. If you need a cheaper solution, I'd recommend storing flat files Parquet (or similar) format in a DFS and using Spark or Hive to query them. – kuujo Dec 20 '16 at 20:53
  • We think we could also consider Cockroach. It is still in beta but looks promising. Another solution that we consider is to switch from SQL and in this case use rethinkdb. – Pavlo Dec 22 '16 at 07:40

2 Answers2

3

Since this question was asked there have been some significant changes in the Distributed SQL or NewSQL landscape...the most noteworthy being the viability of CockroachDB. That appears to be the best option in a situation like the one referenced in this question.

No single points of failure. Easy to scale. Can handle tons of volume. You can run it wherever you want. Speaks postgres. Super fault tolerant.

  • Hi Daniel, thanks for that answer. This topic also has many related answers on SO. I recommend that you also add a link to one of the more popular answers, here: https://stackoverflow.com/q/1145726/212950 – MAbraham1 Oct 05 '20 at 21:14
0

Amazon Redshift seems to be the best answer(thank you kuujo). But we will try rethinkdb because it has some nice feature

Pavlo
  • 1,594
  • 2
  • 15
  • 30