8

I am thinking of re-developing an application which stores and visualise sensor data. The application is accessible by multiple users and they can add unlimited sensors. We are 10 users right now and we have about 1000 sensors. Number of users will probably not change. The sensors send data every 5 seconds if there is enough power(solar panels).

Right now data is stored in 4 tables.

  • User [id, email, password, etc..]
  • Device [id, name, user_id] user_id: foreign key
  • Sensor [id, device_id, type] device_id: foreign key
  • Data [id, sensor_id, time, data] sensor_id: foreign key

The problem is that Data table gets extremely big. The goal is to retain data for almost a year. I use MySQL and I am very disappointed with the performance of it. Right now I am using flask with Gunicorn and I am using RabbitMQ to queue the storing procedure. Is there anything I can change to increase the performance of the existing system? If you started this thing from scratch what changes you would make? Will NoSQL make a big difference in this situation? I am asking too much but it's the first time I face this kind of problem.

kechap
  • 2,077
  • 6
  • 28
  • 50
  • What is it about the performance that is unacceptable? Can you be more specific? Without knowing what operation you are performing that is executing too slowly, I don't see how this is answerable. – Robert Harvey Aug 29 '13 at 20:27
  • The problem appears when I try to select data in a period of time. The user just sits there and waits for the query to execute. I am talking about 30 second. Sometimes it hits timeout which is set at 45 second. – kechap Aug 29 '13 at 20:37
  • 1
    Did you index your `time` column? – Robert Harvey Aug 29 '13 at 20:38
  • No I didn't. I'll try it. Hopefully it will improve the situation. – kechap Aug 29 '13 at 20:43
  • 1
    You should see a dramatic improvement. – Robert Harvey Aug 29 '13 at 20:47
  • 10-12 sec improvement. Not bad. Anything else I can do? The table is already optimized. – kechap Aug 29 '13 at 21:07
  • How many records did you get back? How long is it actually taking now? Can you add your query code to the question? – Robert Harvey Aug 29 '13 at 21:11
  • I am using a simple select from-to datetime. Nothing special. I used it for 10 queries selecting 1k to 10k rows. Your comment drove me to this [http://stackoverflow.com/questions/4594229/mysql-integer-vs-datetime-index]. At least I have something to try. That will take a while though. – kechap Aug 29 '13 at 21:18

4 Answers4

4
  1. As you have 1k sensors and each generates data every 5 seconds, seems to me like nice example where to use frameworks like Akka to process many requests and avoid many threads issue

  2. Once your processing stage looks like be optimized, you correctly wrote about NoSQL. Guys in comments mentioned missing index but as you have only one table this could cause that every insert to your table triggers index recalculation for all data. This can kill your app's throughput.

    You have a lot of options how to solve this issue. Divide tables to the last contain newest data or use two tables, one for reading and queries and second for writing along with bulk insert from second to first - this is definitely fast using cut off indexes. There is well know problem that you can optimize you storage for massive reading or massive writing, not for both.

    Or you can look at NoSQL, especially Redis come into my mind, look at their data type http://redis.io/topics/data-types-intro

    Redis supports long list by nature. As it does not support any querying in the mean of SELECT ... FROM ... WHERE ... you would have to provide your own indexes and caches in order to provide your required queries. If you would be interested how to use key:value store, just look at their twitter demo. Twitter must solve same issues like you do.

This brings me to my last point. If you want to provide better scalability and you do not know how, just look at facebook, twitter or netflix architecture.

Martin Podval
  • 1,097
  • 1
  • 7
  • 16
  • 1. It's not yet a problem. It just works. 2. What about mongodb? It seems not that scary. I have already a problem to structure my data with mongodb and redis looks like a big jump for me. – kechap Aug 30 '13 at 11:37
  • Mongo is designed to handle documents like json. It does not support querying but you can define index above certain JSON's property. I'm not familiar with mongo as I haven't used it in real a product. The significant point which database to use is what you want to do with data. Do you need to query them using queries or you want to use large arrays (millions of items) or large hashsets. This was the reason why it recommended to you to read redis data types topic, maybe you find out that those structures are rights for you. – Martin Podval Aug 30 '13 at 19:20
  • Thanks for pointing me redis. I found this[http://strataconf.com/strata2013/public/schedule/detail/27350] and I think this is how I should proceed. – kechap Sep 05 '13 at 08:22
3

No discussion about telemetry data would be complete without discussing solutions that have already been proven in industry.

HDF5 is one such solution. HDF5 is a data model, library, and file format for storing and managing telemetry data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data.

SQL Server has a FILESTREAM data type that is uniquely suited to handling large telemetry data sets. McClaren Systems uses it to collect telemetry data from Formula One race cars.

Further Reading
Programming with FileStream
McClaren Case Study

Robert Harvey
  • 178,213
  • 47
  • 333
  • 501
2

Database landscape have changed so much since you asked, but the question remains valid or more so today. To summarize, your needs seem to be the following:

  1. Storing large amount of sensor data / time series data.
  2. Be able to query data fast.
  3. Easily/automatically drop data after a given period: data retention policy

It looks like you need a database optimized for sensor/IoT/time series data. According to DB-Engines over the last 2 years, time series database has gained the most traction. I think it is worth to try these databases as they are optimized for this kind of data. Some notable ones to look at:

  1. InfluxDB open source time series database for sensors and systems.
  2. GridDB open source NoSQL time series database for IoT and Big data.
  3. CrateDB distributed SQL for Industrial IoT

These databases are all designed to store time series/IoT data with fast ingest and query, and have data retention function.

For example, with GridDB your data model would be something like this:

  • A user's container fields user_id, email, password, etc.
  • A device's container for each device with KEY "DEVICES_$userID" and fields deviceId and an sensorIDs array
  • A sensor container for each device with Key "SENSOR_$sensorID" with fields time and data.

Containers work like a table would but partitions the data without tricks where you can quickly query data of individual or a set of sensors. As each sensor data is stored per container, you will not get a bloated table.

Codelicious
  • 355
  • 1
  • 10
1

As Martin Podval you should look at a NoSql, but then again you can try a few tricks. First, start partitioning your data into multiple tables. Depending on the time range most often used you can partition for one table for a week or a table per month. Then for time ranges you will have to query multiple tables and combine the results(a small map&reduce job) but multiple querys on smaller tables will prove to be faster then a single query on a large table.

The second trick is to optimize your indexing for the tables, and avoid JOIN operations at all costs.

Lastly you can add caching, this is a really old trick, and it is debated a lot, but 10 users on 1000 sensors for one year, I think there is a good probability that they look at the same data more then once.

I think the best solution is not to just use a NoSQL solution, but something more along the lines of distributed, even with cheep servers you will get better performance. Doing the math you should have about 6.3 billion records for one year. No matter how fast a computer is, and what system(storage system) it uses, it takes a long time to read that data even from memory.

lcornea
  • 141
  • 1
  • 3