Timeseries storage in Mongodb

Question

I have about 1000 sensors outputting data during the day. Each sensor outputs about 100,000 points per day. When I query the data I am only interested in getting data from a given sensor on a given day. I don t do any cross sensor queries. The timeseries are unevenly spaced and I need to keep the time resolution so I cannot do things like arrays of 1 point per second.

I plan to store data over many years. I wonder which scheme is the best:

each day/sensor pair corresponds to one collection, thus adding 1000 collections of about 100,000 documents each per day to my db
each sensor corresponds to a collection. I have a fixed number of 1000 collections that grow every day by about 100,000 documents each.

1 seems to intuitively be faster for querying. I am using mongoDb 3.4 which has no limit for the number of collections in a db.

2 seems cleaner but I am afraid the collections will become huge and that querying will gradually become slower as each collection grows

I am favoring 1 but I might be wrong. Any advice?

Update:

I followed the advice of

https://bluxte.net/musings/2015/01/21/efficient-storage-non-periodic-time-series-mongodb/

Instead of storing one document per measurement, I have a document containing 128 measurement,startDate,nextDate. It reduces the number of documents and thus the index size but I am still not sure how to organize the collections.

When I query data, I just want the data for a (date,sensor) pair, that is why I thought 1 might speed up the reads. I currently have about 20,000 collections in my DB and when I query the list of all collections, it takes ages which makes me think that it is not a good idea to have so many collections.

What do you think?

What do you want to do with thoses data? What precision level interest you for visualisation ? (for example, see average by minuts/hours/day? Min/max on a time period? ...) — felix, Mar 16 '17 at 07:34
I did a test a few weeks ago where I was doing aggregation (`$match` then `$count`) for 20 million rows of data and it took > 12 seconds before and ~1.6 seconds after my indexes were set up correctly. If you store multiple records per document, you will need to use compound indexes (which is kind of an art to set up) and aggregation to be able to query efficiently. — Ryan Wheale, Mar 22 '17 at 21:55
I need the exact timestamps on each data point so no aggregation. — Fred, Mar 23 '17 at 07:24

score 5 · Accepted Answer · answered Mar 22 '17 at 13:55

5

I would definitely recommend approach 2, for a number of reasons:

MongoDB's sharding is designed to cope with individual collections getting larger and larger, and copes well with splitting data within a collection across separate servers as required. It does not have the same ability to split data which exists in many collection across different servers.
MongoDB is designed to be able to efficiently query very large collections, even when the data is split across multiple servers, as long as you can pick a suitable shard key which matches your most common read queries. In your case, that would be sensor + date.
With approach 1, your application needs to do the fiddly job of knowing which collection to query, and (possibly) where that collection is to be found. Approach 2, with well-configured sharding, means that the mongos process does that hard work for you

answered Mar 22 '17 at 13:55

Vince Bowdren

8,326
3
31
56

So you would actually recommend having just a single collection where each document is (sensorId,TimeStamp,Value) so that I can create a compound index on (sensorId,TimeStamp) and shard on it? If I just create one collection per sensor with (TimeStamp,Value) docs then I cannot shard on (sensorId,TimeStamp). If I just shard on (TimeStamp) I will run into write hotspots thus negating the benefit of sharding. If I just have a gigantic collection with (sensorId,TimeStamp,Value), I have the impression that sensorId would cost storage space for no reason. What s the best practice? – Fred Mar 24 '17 at 00:36
Possibly just a single collection, yes; though as you point out the storage and indexing costs may become significant. So you will be balancing trade-offs, to some degree. There's another interesting post on [storing time-series data in MongoDB](https://www.mongodb.com/blog/post/schema-design-for-time-series-data-in-mongodb) worth reading up on. – Vince Bowdren Mar 24 '17 at 09:40

score 3 · Answer 2 · edited May 23 '17 at 12:00

Whilst MongoDB has no limit on collections I tried a similar approach to 2 but moved away from it to a single collection for all sensor values because it was more manageable.

Your planned data collection is significant. Have you considered ways to reduce the volume? In my system I compress same-value runs and only store changes, I can also reduce the volume by skipping co-linear midpoints and interpolating later when, say, I want to know what the value was at time 't'. Various different sensors may need different compression algorithms (e.g. a stepped sensor like a thermostat set-point vs one that represents a continuous quantity like a temperature). Having a single large collection also makes it easy to discard data when it does get too large.

If you can guarantee unique timestamps you may also be able to use the timestamp as the _id field.

I m only storing differences so each point is mostly a pair. Timestamp is unique and I already use it as ID. So I basically store a pair. — Fred, Mar 16 '17 at 06:33

score 2 · Answer 3 · answered Mar 21 '17 at 06:03

When I query the data I m only interested in getting data from a given sensor on a given day. I don t do any cross sensor queries.

But that's what exactly what Cassandra is good for! See this article and this one.

Really, in one of our my projects we were stuck with legacy MongoDB and the scenario, similar to yours, with the except of new data amount per day was even lower. We tried to change data structure, granulate data over multiple MongoDB collections, changed replica set configurations, etc. But we were still disappointed as data increases, but performance degrades with the unpredictable load and reading data request affects writing response much.
With Cassandra we had fast writes and data retrieving performance effect was visible with the naked eye. If you need complex data analysis and aggregation, you could always use Spark (Map-reduce) job. Moreover, thinking about future, Cassandra provides straightforward scalability.

I believe that keeping something for legacy is good as long as it suits well, but if not, it's more effective to change the technology stack.

score 1 · Answer 4 · edited May 23 '17 at 12:25

If I understand right, you plan to create collections on the fly, i.e. at 12 AM you will have new collections. I guess MongoDB is a wrong choice for this. If required in MongoDB there is no way you can query documents across collections, you will have to write complex mechanism to retrieve data. In my opinion, you should consider elasticsearch. Where you can create indices(Collections) like sensor-data-s1-3-14-2017. Here you could do a wildcard search across indices. (for eg: sensor-data-s1* or sensor-data-*). See here for wildcard search.

If you want to go with MongoDB my suggestion is to go with option 2 and shard the collections. While sharding, consider your query pattern so you could get optimal performance and that does not degrade over the period.

Hi, I don t need to store document across collections and I d like to keep using mongoDb for legacy reasons. — Fred, Mar 16 '17 at 00:36

score 0 · Answer 5 · answered Mar 23 '17 at 05:38

Approach #1 is not cool, key to speed up is divide (shard) and rule. What-if number of singal itself reaches 100000.

So place one signal in one collection and shard signals over nodes to speed up read. Multiple collections or signals can be on same node.

How this Will Assist

Usually for signal processing time-span is used like process signal for 3 days, in that case you can parallel read 3 nodes for the signal and do parallel apache spark processing.
Cross-Signal processing: typically most of signal processing algorithms uses same period for 2 or more signals for analysis like cross correlation and as these (2 or more signals) are parallel fetch it'll also be fast and ore-processing of individual signal can be parallelized.

Timeseries storage in Mongodb

5 Answers5