I'm gathering event logs every time a property of some device is changed. For this purpose I decided to use:
- Logstash - where my agent IoT application sends logs to in JSON format,
- Elasticsearch - for storing data (logs),
- Kibana - for data visualisation.
The JSON with logs is being send in regular intervals and its form is as follows:
{"deviceEventLogs":[{"date":"16:16:39 31-08-2016","locationName":"default","property":"on","device":"Lamp 1","value":"
false","roomName":"LivingRoom"}, ... ,]}
Example of single event entry in Elasticsearch looks as follows:
{
"_index": "logstash-2016.08.25",
"_type": "on",
"_id": "AVbDYQPq54WlAl_UD_yg",
"_score": 1,
"_source": {
"@version": "1",
"@timestamp": "2016-08-25T20:25:28.750Z",
"host": "127.0.0.1",
"headers": {
"request_method": "PUT",
"request_path": "/deviceEventLogs",
"request_uri": "/deviceEventLogs",
"http_version": "HTTP/1.1",
"content_type": "application/json",
"http_user_agent": "Java/1.8.0_91",
"http_host": "127.0.0.1:31311",
"http_accept": "text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2",
"http_connection": "keep-alive",
"content_length": "34861"
},
"date": "2016-08-08T14:48:11.000Z",
"device": "Lamp 1",
"property": "on",
"locationName": "default",
"roomName": "LivingRoom",
"value_boolean": true
}
}
My goal is to create a website with some kind of dashboard showing analyzed data in resonable time (several minutes should be acceptable) i.e.:
- showing history of energy consumption and predicting the consumption in the feature
- detecting anomalies in energy consumption or other factors like lights or heating usage
- showing recomendations based on some kind of not sofisticated statistics i.e. "you can move a given device from location1 to location2 because it's more needed there (more intensively used than in other place)", etc.
While the last point is quite trivial - I can use simple query or aggregation in Elasticsearch and then compare it to some treshold value, the first two points require in-depth analysis like machine learning or data mining.
For now the system is eqquiped with around 50 devices updating their status every 10 sec in average. In the future the number of devices can increase up to 50 000. Assumig 100 bytes for one event log it can lead in approximation of around 15 Terabytes of data in Elasticsearch per year.
The general question is - what can be a resonable solutions / technology / architecture of such system?
- Is it a resonable start to store all my logs in Elasticsearch?
- I consider es-hadoop library to use Elasticsearch along with Apache Spark to have an ability to process my data using Mlib in Spark - is it a resonable direction to go?
- Can I use only Elasticsearch to store all my data in it and just use Spark and Mlib to provide in-depth analysis or should I consider implementing so called "Lambda Architecture" treating Elasticsearch as a Speed Layer? I've red a bit about various configurations where Kafka, Apache Storm was used but I'm not really sure I need it. Since the project should be done within a one month and I'm a beginner, I'm worried about complexity and hence time needed for such implementation.
- What if data load would be 10x smaller (around 1,5 Terabytes per year) - will your answer be the same?