28

I'm trying to use mongodb for a time series database and was wondering if anyone could suggest how best to set it up for that scenario.

The time series data is very similar to a stock price history. I have a collection of data from a variety of sensors taken from different machines. There are values at billion's of timestamps and I would like to ask the following questions (preferably from the database rather than the application level):

  1. For a given set of sensors and time interval, I want all the timestamps and sensor values that lie within that interval in order by time. Assume all the sensors share the same timestamps (they were all sampled at the same time).

  2. For a given set of sensors and time interval, I want every kth item (timestamp, and corresponding sensor values) that lie within the given interval in order by time.

Any recommendation on how to best set this up and achieve the queries?

Thanks for the suggestions.

sequoia
  • 3,025
  • 8
  • 33
  • 41

3 Answers3

21

Obviously this is an old question, but I came across it when I was researching MongoDB for timeseries data. I thought that it might be worth sharing the following approach for allocating complete documents in advance and performing update operations, as opposed to new insert operations. Note, this approach was documented here and here.

Imagine you are storing data every minute. Consider the following document structure:

{
  timestamp: ISODate("2013-10-10T23:06:37.000Z"),
  type: ”spot_EURUSD”,
  value: 1.2345
},
{
  timestamp: ISODate("2013-10-10T23:06:38.000Z"),
  type: ”spot_EURUSD”,
  value: 1.2346
}

This is comparable to a standard relational approach. In this case, you produce one document per value recorded, which causes a lot of insert operations. We can do better. Consider the following:

{
  timestamp_minute: ISODate("2013-10-10T23:06:00.000Z"),
  type: “spot_EURUSD”,
  values: {
    0: 1.2345,
    …  
    37: 1.2346,
    38: 1.2347,
    … 
    59: 1.2343
  }
}

Now, we can write one document, and perform 59 updates. This is much better because updates are atomic, individual writes are smaller, and there are other performance and concurrency benefits. But what if we wanted to store the entire day, and not just the entire hours, in one document. This would then require us to walk along 1440 entries to get the last value. To improve on this, we can extend further to the following:

{
  timestamp_hour: ISODate("2013-10-10T23:00:00.000Z"),
  type: “spot_EURUSD”,
  values: {
    0: { 0: 1.2343, 1: 1.2343, …, 59: 1.2343},
    1: { 0: 1.2343, 1: 1.2343, …, 59: 1.2343},
    …,
    22: { 0: 1.2343, 1: 1.2343, …, 59: 1.2343},
    23: { 0: 1.2343, 1: 1.2343, …, 59: 1.2343}
  }
}

Using this nested approach, we now only have to walk, at maximum, 24 + 60 to get the very last value in the day.

If we build the documents with all the values filled-in with padding in advance, we can be sure that the document will not change size and therefore will not be moved.

jtromans
  • 4,183
  • 6
  • 35
  • 33
  • The data is stored in UTC, how would you query the data in your last example for a user in their time zone? For example, query the data between a date range in a time zone other than UTC. – ColinMc Feb 19 '14 at 17:52
  • 2
    It is sensible to store data in a universal timezone (like UTC). To make a range search based on a different timezone, I would use moment.js to do the conversion from the local time into UTC. – jtromans Feb 20 '14 at 08:34
  • Also, see this link: http://www.mongodb.com/presentations/webinar-how-banks-use-mongodb-tick-database – jtromans Nov 05 '14 at 11:46
  • The final solution is this? `code { timestamp: ISODate("2013-10-10T23:00:00.000Z"), type: “spot_EURUSD”, values: { 0: { 0: 1.2343, 1: 1.2343, …, 59: 1.2343}, 1: { 0: 1.2343, 1: 1.2343, …, 59: 1.2343}, …, 22: { 0: 1.2343, 1: 1.2343, …, 59: 1.2343}, 23: { 0: 1.2343, 1: 1.2343, …, 59: 1.2343} } } ` ? Or i need use on document for minute/hour ... ? – Paulo Coutinho Aug 21 '15 at 02:36
  • Hi Paulo, it completely depends on your use case. If you have a use case where you need to pull out individual minutes, or seconds, then you should continue 'binning' your data accordingly down to the finest resolution necessary to satisfy the criteria. – jtromans Aug 21 '15 at 09:24
  • 2
    Is this still the recommended approach after WiredTiger became available? – Dan Dascalescu Sep 09 '16 at 07:08
  • It's a good question. I haven't tried it with WiredTiger version since we are storing the data in a completely different way now and use a different NoSQL database instead. However, I would encourage you to try it yourself and benchmark. For sure this is still a very convenient way to index and retrieve data. – jtromans Sep 10 '16 at 13:33
  • This is the actual source to this approach: http://blog.mongodb.org/post/65517193370/schema-design-for-time-series-data-in-mongodb – st.huber Nov 21 '16 at 14:44
  • @DanDascalescu this link seems to suggest the performance is similar : http://learnmongodbthehardway.com/schema/timeseries/ – Jonathan Hartnagel May 23 '18 at 13:30
14

If you don't need to keep the data for ever (ie. you don't mind it 'ageing out') you may want to consider a 'capped collection'. Capped collections have a number of restrictions that in turn provide some interesting benefits which sound like they fit what you want quite well.

Basically, a capped collection has a specified size, and documents are written to it in insertion order until it fills up, at which point it wraps around and begins overwriting the oldest documents with the newest. You are slightly limited in what updates you can perform on the documents in a capped collection - ie. you cannot perform an update that will change the size of the document (as this would mean it would need to be moved on disk to find the extra space). I can't see this being a problem for what you describe.

The upshot is that you are guaranteed that the data in your capped collection will be written to, and will stay on, disk in insertion order, which makes queries on insertion order very fast.

How different are the sensors and the data they produce, by the way? If they're relatively similar I would suggest storing them all in the same collection for ease of use - otherwise split them up.

Assuming you use a single collection, both your queries then sound very doable. One thing to bear in mind would be that to get the benefit of the capped collection you would need to be querying according to the collections 'natural' order, so querying by your timestamp key would not be as fast. If the readings are taken at regular intervals (so you know how many of them would be taken in a given time interval) I would suggest something like the following for query 1:

db.myCollection.find().limit(100000).sort({ $natural : -1 })

Assuming, for example, that you store 100 readings a second, the above will return the last 100 seconds worth of data. If you wanted the previous 100 seconds you could add .skip(100000).

For your second query, it sounds to me like you'll need MapReduce, but it doesn't sound particularly difficult. You can select the range of documents you're interested in with a similar query to the one above, then pick out only the ones at the intervals you're interested in with the map function.

Here's the Mongo Docs on capped collections: http://www.mongodb.org/display/DOCS/Capped+Collections

Hope this helps!

Russell
  • 12,261
  • 4
  • 52
  • 75
  • Thanks Russel. We do want to store the data forever, that is one of the reasons mongodb is being considered, for its scalability. The sensors are relatively similar, some of them have different sample rates and are recorded by different machines. This means that there is a slight chance of misalignment in the timestamps. So the different sensors will probably be stored in separate collections. I am curious to know if mongodb naturally stores results in the order of being recorded? I am trying to get an idea of how snappy queries for particular time ranges will be once we have a lot of data. – sequoia Sep 12 '11 at 16:50
  • (Start with the comment above) And thanks for directing me to Map Reduce. With a first glance at the data, the full resolution is not required, but if need be you could zoom in. The data will be plotted using flot. Either the Max or Average over a window or perhaps just every kth sample will be plotted. Finding the max or average might be very slow, so I wanted to get a sense of how fast queries are to see if this is possible or whether it would take too much time. Any idea if the data is stored in order? if returning results in order of timestamp instead of id can be done quickly? – sequoia Sep 12 '11 at 16:56
  • Storing of data in insertion order is only guaranteed for capped collections I'm afraid. However it sounds like the data you'll be storing will be write once, never update, which *might* end up being stored in insertion order by default - although I would imagine you'd end up with a block of one collection, then a block of another etc etc. As long as you have an index on the timestamp column your queries should be relatively fast. Bear in mind though that MapReduce is a 'slow' operation - it's strength is that it scales reliably (and horizontally). Hope this helps! – Russell Sep 16 '11 at 08:37
  • Thanks. I'll just index the time stamps and tolerate the speed of map reduce. I also intend to look into the "New Aggregation Framework" in place of Map reduce. – sequoia Sep 20 '11 at 23:55
0

I know that this is an old question, but I found these blogs that helped me a lot:

StPaulis
  • 2,844
  • 1
  • 14
  • 24