76

Following the Prometheus webpage one main difference between Prometheus and InfluxDB is the usecase: while Prometheus stores time series only InfluxDB is better geared towards storing individual events. Since there was some major work done on the storage engine of InfluxDB I wonder if this is still true.

I want to setup a time series database and apart from the push/push model (and probably a difference in performance) I can see no big thing which separates both projects. Can someone explain the difference in usecases?

SpaceMonkey
  • 965
  • 1
  • 8
  • 12

4 Answers4

93

InfluxDB CEO and developer here. The next version of InfluxDB (0.9.5) will have our new storage engine. With that engine we'll be able to efficiently store either single event data or regularly sampled series. i.e. Irregular and regular time series.

InfluxDB supports int64, float64, bool, and string data types using different compression schemes for each one. Prometheus only supports float64.

For compression, the 0.9.5 version will have compression competitive with Prometheus. For some cases we'll see better results since we vary the compression on timestamps based on what we see. Best case scenario is a regular series sampled at exact intervals. In those by default we can compress 1k points timestamps as an 8 byte starting time, a delta (zig-zag encoded) and a count (also zig-zag encoded).

Depending on the shape of the data we've seen < 2.5 bytes per point on average after compactions.

YMMV based on your timestamps, the data type, and the shape of the data. Random floats with nanosecond scale timestamps with large variable deltas would be the worst, for instance.

The variable precision in timestamps is another feature that InfluxDB has. It can represent second, millisecond, microsecond, or nanosecond scale times. Prometheus is fixed at milliseconds.

Another difference is that writes to InfluxDB are durable after a success response is sent to the client. Prometheus buffers writes in memory and by default flushes them every 5 minutes, which opens a window of potential data loss.

Our hope is that once 0.9.5 of InfluxDB is released, it will be a good choice for Prometheus users to use as long term metrics storage (in conjunction with Prometheus). I'm pretty sure that support is already in Prometheus, but until the 0.9.5 release drops it might be a bit rocky. Obviously we'll have to work together and do a bunch of testing, but that's what I'm hoping for.

For single server metrics ingest, I would expect Prometheus to have better performance (although we've done no testing here and have no numbers) because of their more constrained data model and because they don't append writes to disk before writing out the index.

The query language between the two are very different. I'm not sure what they support that we don't yet or visa versa so you'd need to dig into the docs on both to see if there's something one can do that you need. Longer term our goal is to have InfluxDB's query functionality be a superset of Graphite, RRD, Prometheus and other time series solutions. I say superset because we want to cover those in addition to more analytic functions later on. It'll obviously take us time to get there.

Finally, a longer term goal for InfluxDB is to support high availability and horizontal scalability through clustering. The current clustering implementation isn't feature complete yet and is only in alpha. However, we're working on it and it's a core design goal for the project. Our clustering design is that data is eventually consistent.

To my knowledge, Prometheus' approach is to use double writes for HA (so there's no eventual consistency guarantee) and to use federation for horizontal scalability. I'm not sure how querying across federated servers would work.

Within an InfluxDB cluster, you can query across the server boundaries without copying all the data over the network. That's because each query is decomposed into a sort of MapReduce job that gets run on the fly.

There's probably more, but that's what I can think of at the moment.

Paul Dix
  • 1,967
  • 16
  • 8
  • 40
    Prometheus developer here. Paul is right that Prometheus is and will always be float-only (strings are possible in a limited fashion via labels), whereas InfluxDB supports many data types. I'd presume the query languages are fairly similar in power in practice (Prometheus is Turing Complete). Our HA approach is to have isolated redundant servers, the alertmanager will dedup alerts from them. We generally take an AP approach to monitoring rather than CP, as it's better to lose a little bit of data than your monitoring going down. Prometheus aims to be a system you can rely on in an emergency. – brian-brazil Oct 28 '15 at 20:33
  • 9
    The InfluxDB clustering design is also largely AP, but it aims to be eventually consistent. We achieve that through Hinted Handoff (available in the current release) and Active Anti-Entroy (which we'll start in the 0.9.6 release cycle). Obviously we're not done with clustering yet, but that's the design goal. More details here: https://influxdb.com/blog/2015/06/03/InfluxDB_clustering_design.html – Paul Dix Oct 29 '15 at 16:28
  • 13
    Another Prometheus dev here. Yep, Prometheus itself doesn't aim to be a durable long-term storage. But in other ways, its scope is bigger and more about active systems and service monitoring: from client libraries (which don't only speak some metrics output protocol, but help you manage metrics primitives such as counters, gauges, histograms, and summaries), over active target discovery / collection of data, dashboarding, all the way to alert computation and notification handling. The query language is also not SQL-like, but works very well for computations on dimensional time series data. – Julius Volz Oct 29 '15 at 17:50
  • 7
    And yes, I have to find time to (re)-evaluate InfluxDB 0.9.5 as a long-term storage candidate for Prometheus - I'm hoping it will fix all/most of the problems I've had with earlier InfluxDB versions in the past regarding disk space, ingestion speed, and query performance. We really want to delegate long-term storage to an external system (like InfluxDB, if it works well) instead of trying to solve that ourselves. – Julius Volz Oct 29 '15 at 17:53
  • 11
    A major design difference between the two means that with Prometheus, [there's no easy way of attaching timestamps other than *now* to imported metrics](https://github.com/prometheus/pushgateway#about-timestamps). This may be a deal breaker if the use case involves a source that can experience delays. InfluxDB [seems to suffer no such limitations](https://docs.influxdata.com/influxdb/v0.13/introduction/getting_started/#writing-and-exploring-data) in this regard. – antak May 13 '16 at 05:16
  • pretty interesting discussion. I am wondering if now it would be possible to use an Influxdb instance as a long term storage for Prometheus. We were using Influxdb, but my company is forcing us to move towards Prometheus with a retention period of about 2 weeks. For some metrics my team needs a retention period of 2 months and I'm trying to see if there's a way to directly save those metrics from Prometheus to InfluxDB. – dau_sama Jan 18 '17 at 15:34
  • @PaulDix Could you explain why clustering is only available for the enterprise edition? InfluxDB is an "open source" DB but without clustering it isn't scalable or highly available at all. In fact, the open source version isn't a distributed system so the CAP theorem doesn't even apply. In addition, it appears that backups and restoring are also an enterprise-only feature?? All of this seems very strange to me as it isn't the norm for open source DBs. I'll point to Elasticsearch and MongoDB as excellent and very popular examples of open source DBs with a cloud and/or enterprise offering. – tleef Jan 16 '18 at 19:24
39

We've got the marketing message from the two companies in the other answers. Now let's ignore it and get back to the sad real world of time-data series.

Some History

InfluxDB and prometheus were made to replace old tools from the past era (RRDtool, graphite).

InfluxDB is a time series database. Prometheus is a sort-of metrics collection and alerting tool, with a storage engine written just for that. (I'm actually not sure you could [or should] reuse the storage engine for something else)

Limitations

Sadly, writing a database is a very complex undertaking. The only way both these tools manage to ship something is by dropping all the hard features relating to high-availability and clustering.

To put it bluntly, it's a single application running only a single node.

Prometheus has no goal to support clustering and replication whatsoever. The official way to support failover is to "run 2 nodes and send data to both of them". Ouch. (Note that it's seriously the ONLY existing way possible, it's written countless times in the official documentation).

InfluxDB has been talking about clustering for years... until it was officially abandoned in March. Clustering ain't on the table anymore for InfluxDB. Just forget it. When it will be done (supposing it ever is) it will only be available in the Enterprise Edition.

https://influxdata.com/blog/update-on-influxdb-clustering-high-availability-and-monetization/

Within the next few years, we will hopefully have a well-engineered time-series database that is handling all the hard problems relating to databases: replication, failover, data safety, scalability, backup...

At the moment, there is no silver bullet.

What to do

Evaluate the volume of data to be expected.

100 metrics * 100 sources * 1 second => 10000 datapoints per second => 864 Mega-datapoints per day.

The nice thing about times series databases is that they use a compact format, they compress well, they aggregate datapoints, and they clean old data. (Plus they come with features relevant to time data series.)

Supposing that a datapoint is treated as 4 bytes, that's only a few Gigabytes per day. Lucky for us, there are systems with 10 cores and 10 TB drives readily available. That could probably run on a single node.

The alternative is to use a classic NoSQL database (Cassandra, ElasticSearch or Riak) then engineer the missing bits in the application. These databases may not be optimized for that kind of storage (or are they? modern databases are so complex and optimized, can't know for sure unless benchmarked).

You should evaluate the capacity required by your application. Write a proof of concept with these various databases and measures things.

See if it falls within the limitations of InfluxDB. If so, it's probably the best bet. If not, you'll have to make your own solution on top of something else.

user5994461
  • 5,301
  • 1
  • 36
  • 57
  • 1
    Just FYI: With DalmatinerDB there is already an attempt (?) for a distributed metrics database based on riak_core. But I am not sure how advanced this project is. – SpaceMonkey Jul 19 '16 at 13:57
  • 2
    We use ElasticSearch for storing metrics in production under high load. I can confirm that it's far from ideal for that use case: no built-in retention (we use Elastic's curator on the side), no built-in compression of old data (we run a custom ETL on the side) and no built-in alerting (we run Yelp's ElastAlert on the side). – André Caron May 31 '17 at 16:01
20

InfluxDB simply cannot hold production load (metrics) from 1000 servers. It has some real problems with data ingestion and ends up stalled/hanged and unusable. We tried to use it for a while but once data amount reached some critical level it could not be used anymore. No memory or cpu upgrades helped. Therefore our experience is definitely avoid it, it's not mature product and has serious architectural design problems. And I am not even talking about sudden shift to commercial by Influx.

Next we researched Prometheus and while it required to rewrite queries it now ingests 4 times more metrics without any problems whatsoever compared to what we tried to feed to Influx. And all that load is handled by single Prometheus server, it's fast, reliable, and dependable. This is our experience running huge international internet shop under pretty heavy load.

user3091890
  • 301
  • 3
  • 3
  • 2
    I'm here because we're having similar issues with InfluxDB, particularly memory problems. We have a slightly smaller deployment (100s of servers). Thanks for sharing. – Alexander Torstling Nov 07 '18 at 21:35
  • If you are experiencing oom or high memory usage at InfluxDB, then take a look at VictoriaMetrics - the project I work on. This is a time series database optimized for low resource usage (RAM, CPU, disk space and disk IO). It accepts data in InfluxDB format, so it can be used as InfluxDB replacement. See https://valyala.medium.com/insert-benchmarks-with-inch-influxdb-vs-victoriametrics-e31a41ae2893 – valyala Jun 29 '22 at 12:36
5

IIRC current Prometheus implementation is designed around all the data fitting on a single server. If you have gigantic quantities of data, it may not all fit in Prometheus.

Travis Bear
  • 13,039
  • 7
  • 42
  • 51
  • Good point! But let's say I will have my stuff on one node and everything will work :) – SpaceMonkey Oct 26 '15 at 16:11
  • 5
    Prometheus developer here, it's possible to scale out Prometheus beyond a single server though rarely needed. We value reliability over consistency as that's what's appropriate for critical monitoring, so avoid clustering. See http://www.robustperception.io/scaling-and-federating-prometheus/ – brian-brazil Oct 28 '15 at 20:15
  • At Weave Cloud we've implemented [a multi-tenant version of Prometheus](https://www.weave.works/features/prometheus-monitoring/), which may be of interest to some of you. – errordeveloper Jun 29 '17 at 11:49