Apache Kudu vs InfluxDB on time series data for fast analytics

Question

How does Apache Kudu compare with InfluxDB for IoT sensor data that requires fast analytics (e.g. robotics)?

Kudu has recently released v1.0 I have a few specific questions on how Kudu handles the following:

Sharding?
Data retention policies (keeping data for a specified number of data points, or time and aggregating/discarding data thereafter)?
Are there roll-up /aggregation functionality (e.g. converting 1s interval data into 1min interval data)?
Is there support for continuous queries (i.e. materialised views on data - query to view the 60 seconds on an ongoing basis)?
How is the data stored between disk and memory?
Can regular time series be induced from an irregular one (converting irregular event data into regular time intervals)?

Also are there any other distinct strengths and/or weaknesses between Kudu and InfluxDB?

Is the shortlist limited only to those two databases, because a lot of other implementations can suit the purpose, all the way from plant historians to recently introduced TSDBs. — Sergei Rodionov, Sep 25 '16 at 16:57
I'm looking for somewhat of a full package so am happy to open this question up to other candidates. Influxdb from first impressions is quite good but I am not sure how it scales on a single node (clustering unfortunately they made closed source). I looked at OpenTSDB very briefly but noticed I would have to accept the overall complexity of running a Hadoop/Hbase cluster, that can get little messy. — , Sep 25 '16 at 22:56
Take a look also at alternative time series databases such as VictoriaMetrics or TimescaleDB. — valyala, Jan 26 '20 at 14:56

score 4 · Answer 1 · answered Dec 03 '16 at 07:08

Kudu is a much lower level datastore than InfluxDB. Its more like a distributed file system that provides a few database like features than a full fledged database. It currently relies on a query engine such as Impala for finding data stored in Kudu.

Kudu is also fairly young. It likely would be possible to build a time series database with kudu as the distributed store underneath it, but currently the closest implementation to that would be this proof of concept project.

As for the answers to your questions.

1) Kudu stores data in tablets and offers two ways of partitioning data: Range Partitions and Hash based Partitioning

2) Nope Although if the data was structured with range partitioning, dropping a tablet should be an efficient operation (similar to how InfluxDB drops whole shards when deleting data).

3) Query engines that work with Kudu are able to do this, such as impala or spark.

4) Impala does have some support for views

5) Data is stored in a columnar format similar to Parquet however Kudu's big selling point is that Kudu allows the columnar data to be mutable, which is something that is very difficult with current parquet files.

6) While I'm sure you could get spark or impala to do this, its not a built in feature.

Kudu is still a new project and it is not really designed to compete with InfluxDB but rather give a highly scalable and highly performant storage layer for a service like InfluxDB. The ability to append data to a parquet like data structure is really exciting though as it could eliminate the need for lambda architectures.

Apache Kudu vs InfluxDB on time series data for fast analytics

1 Answers1