0

We have a need to run analytics queries on the data stored in rds. And that's becoming very very slow because of group by queries and ever increasing size of the tables. For example we have following 3 tables in RDS :

alm(id,name,cli, group_id, con_id ...)
group(id, type,timestamp ...)
con(id,ip,port ...)

each of the tables have very high amount of data and are being updated several times a minute as the new data comes in.

Now we want to run aggregation queries like :

select name from alm, group, con where alm.group_id=group.id and alm.con_id=con.id group by name, group.type, con.ip

We also want users to run custom aggregation queries in the future as opposed to the fix query provided by us in future.

So far the options we are considering are moving to either Cassandra, Elasticsearch or Dynamo db so that aggregation would be faster. Can someone guide as to how to go about this problem ? Or any crumbs of experience ? Anybody know any technologies have severe advantage over others ?

voidMainReturn
  • 3,339
  • 6
  • 38
  • 66
  • Relevant question: https://stackoverflow.com/questions/27054954/elasticsearch-vs-cassandra-vs-elasticsearch-with-cassandra – Aaron May 30 '19 at 21:26

3 Answers3

1

Cassandra and DynamoDB are quite different from ElasticSearch. And all three are very different from relational database offerings.

For ad-hoc analytics, relational databases, with a well designed schema, can be pretty good up to the point where you need to split your data across multiple servers (then replication issues start to dominate the benefits). And that's really the primary motivation for non-relational databases. But the catch is that in order to solve the horizontal scaling problem, they generally trade some features such as joining and aggregating.

Elastic search is really great at answering search queries, but not particularly good at aggregations (other than very basic counts, sums and their estimates). It's amazing at indexing copious amounts of data but it can't answer queries that require complex cross index operations. It is also not as robust (rebuilding indexes may be needed from time to time)

If you have high volumes of data and you need aggregation, you pretty much have two options:

  1. if you can get away with offline analytics, then distributed data processing frameworks such as Spark can get you the answers you need very efficiently

  2. if you need online analytics, the most common approach is to pre-compute the aggregations and update as you get more data, so that answers to queries can be very fast without having to process a lot of data for each query

Don't be afraid to mix and match though. Relational databases have their purpose as do non-relationals. There is no silver bullet though.

Mike Dinescu
  • 54,171
  • 16
  • 118
  • 151
  • minor note: elasticsearch can query multiple indices: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multiple-indices.html – Aydin K. Jun 06 '21 at 18:11
  • 1
    Yes, thank you. I've updated the answer. I meant that ES is not designed to do complex joins – Mike Dinescu Jun 06 '21 at 20:10
0

One more options is Column-oriented databases, this kind of DB is more suitable for 'analytics' cases when you have many data fields and you want to perform aggregations or extract some subset of fields for big amount of data.

Recently Yandex ClickHouse becomes very popular and there is Column Oriented service from Amazon - Redshift. Also there are several other solutions

Mikhail Baksheev
  • 1,394
  • 11
  • 13
0

Store in parquet and use spark, partition efficiently

Amareswar
  • 2,048
  • 1
  • 20
  • 36