Edit: some people flagged this question as a potential duplicate of this other one. While I agree that knowing how the birthday paradox applies to hashing functions, the 2 questions (and respective answers) address 2 different, albeit related, subjects. The other question is asking "what are the odds of collision", whereas this question main focus is "how can I make sure that collision never happens".
I have a data lake stored in S3 where each day an ETL script dumps additional data from the day before.
Due to how the pipeline is built, it is possible for a very inconsiderate user that has admin access to produce duplicates in said data lake by manually interacting with the dump files coming from our OLTP database, and triggering the ETL script when it's not supposed to.
I thought that a good idea to prevent data duplication was to insert a form of security measure in my ETL script:
- Produce a hash for each entry.
- Store said hashes somewhere else (like a dynamodb table).
- Whenever new data comes in, hash that as well and compare it with the already existing hashes.
- If any of new hash is in the existing hashes, reject the associated entry entirely.
However, I know very little about hashing and I was reading that, although unlikely, 2 different sources can produce the same hash.
I understand it's really hard for it to happen in this situation, but I was wondering if there is a way to be 100% sure about it.
Any idea is much appreciated.