5

Scenario: I have a service that logs events like in this CSV example:

#TimeStamp, Name, ColorOfPullover
TimeStamp01, Peter, Green
TimeStamp02, Bob, Blue
TimeStamp03, Peter, Green
TimeStamp04, Peter, Red
TimeStamp05, Peter, Green

Events that e.g. Peter wears Green will occur very often in a row.

I have two goals:

  1. Keep the data as small as possible
  2. Keep all the relevant data

Relevant means: I need to know, in which time spans a person was wearing what color. E.g:

#StartTime, EndTime, Name, ColorOfPullover
TimeStamp01, TimeStamp03, Peter, Green
TimeStamp02, TimeStamp02, Bob, Blue
TimeStamp03, TimeStamp03, Peter, Green
TimeStamp04, TimeStamp04, Peter, Red
TimeStamp05, TimeStamp05, Peter, Green

In this format, I can answer questions like: Which color was Peter wearing at time TimeStamp02? (I can safely assume that each person wears the same color in between two logged events for the same color.)

Main question: Can I use an already existing technology to accomplish that? I.e. I can supply it with a continuous stream of events and it extracts and stores the relevant data?


To be precise, I need to implement an algorithm like this (pseudocode). The OnNewEvent method is called for each line of the CSV example. Where parameter event already contains the data from the line as member variables.

def OnNewEvent(even)
    entry = Database.getLatestEntryFor(event.personName)
    if (entry.pulloverColor == event.pulloverColor)
        entry.setIntervalEndDate(event.date)
        Database.store(entry)
    else
        newEntry = new Entry
        newEntry.setIntervalStartDate(event.date)
        newEntry.setIntervalEndDate(event.date)
        newEntry.setPulloverColor(event.pulloverColor))
        newEntry.setName(event.personName)
        Database.createNewEntry(newEntry)
    end
end
fex
  • 197
  • 1
  • 13
  • It should be possible to do it with logstash, but the problem is that you'll have to do an elasticsearch request for each line to retrieve the latest entry, which will make the process very slow. That's why I don't think that logstash is the right tool for this. – baudsp Sep 21 '17 at 16:16
  • What are you data volumes and how quickly do you need to react when a new event occurs? Is it ok if some events are lost? – Oleg Kuralenko Sep 27 '17 at 19:07
  • Reaction to events may be slow. E.g. 1 day delay is acceptable. So, a cron job one a day could be an option. Events may not be lost, that is mission critical. – fex Sep 28 '17 at 11:05

2 Answers2

0
This is typical scenario of any streaming architecture.  

There are multiple existing technologies which work in tandem  to get what you want. 


1.  NoSql Database (Hbase, Aerospike, Cassandra)
2.  streaming jobs Like Spark streaming(micro batch), Storm 
3.  Run mapreduce in micro batch to insert into NoSql Database.
4.  Kafka Distriuted queue

The end to end flow. 

Data -> streaming framework -> NoSql Database. 
OR 
Data -> Kafka -> streaming framework -> NoSql Database. 


IN NoSql database there are two ways to model your data. 
1. Key by "Name" and for every event for that given key, insert into Database.
   While fetching u get back all events corresponding to that key. 

2. Key by "name", every time a event for key is there, do a UPSERT into a existing blob(Object saved as binary), Inside the blob you maintain the time range and color seen.  

Code sample to read and write to Hbase and Aerospike  

Hbase: http://bytepadding.com/hbase/

Aerospike: http://bytepadding.com/aerospike/

KrazyGautam
  • 2,839
  • 2
  • 21
  • 31
0

One way to do it is to use HiveMQ. HiveMQ is a MQTT based message queue technology. Nice thing about it is you can write custom plugins to process incoming message. To get the latest entry of a event for a person, a hash table in HiveMQ plugin would work fine. If number of different persons is very large, I would consider using a cache like Redis to cache the latest event for each person.

You service publishes events to HiveMQ. HiveMQ plugin processes incoming event, and make updates to your database.

HiveMQ Plugin

Redis

DXM
  • 1,249
  • 1
  • 14
  • 22