We have an external service that continuously sends us data. For the sake of simplicity lets say this data has three strings in tab delimited fashion.
datapointA datapointB datapointC
This data is received by one of our servers and then is forwarded to a processing engine where something meaningful is done with this dataset.
One of the requirements of the processing engine is that duplicate results will not be processed by the processing engine. So for instance on day1, the processing engine received
A B C
, and on day 243, the same A B C
was received by the server. In this particular situation, the processing engine will spit out a warning,"record already processed" and not process that particular record.
There may be a few ways to solve this issue:
Store the incoming data in an in-memory HashSet, and set exculsion will indicate the processing status of the particular record. Problems will arise when we have this service running with zero downtime and depending on the surge of data, this collection can exceed the bounds of memory. Also, in case of system outages, this data needs to be persisted someplace.
Store the incoming data in the database and the next set of data will only be processed if the data is not present in the database. This
helps with the durability of the history in case of some catastrophe but there's the overhead of maintaing proper-indexes and aggressive
sharding in the case of performance related issues.
....or some other technique
Can somebody point out some case-studies or established patterns or practices to solve this particular issue?
Thanks