3

Edit: some people flagged this question as a potential duplicate of this other one. While I agree that knowing how the birthday paradox applies to hashing functions, the 2 questions (and respective answers) address 2 different, albeit related, subjects. The other question is asking "what are the odds of collision", whereas this question main focus is "how can I make sure that collision never happens".

I have a data lake stored in S3 where each day an ETL script dumps additional data from the day before.

Due to how the pipeline is built, it is possible for a very inconsiderate user that has admin access to produce duplicates in said data lake by manually interacting with the dump files coming from our OLTP database, and triggering the ETL script when it's not supposed to.

I thought that a good idea to prevent data duplication was to insert a form of security measure in my ETL script:

  • Produce a hash for each entry.
  • Store said hashes somewhere else (like a dynamodb table).
  • Whenever new data comes in, hash that as well and compare it with the already existing hashes.
  • If any of new hash is in the existing hashes, reject the associated entry entirely.

However, I know very little about hashing and I was reading that, although unlikely, 2 different sources can produce the same hash.

I understand it's really hard for it to happen in this situation, but I was wondering if there is a way to be 100% sure about it.

Any idea is much appreciated.

wtfzambo
  • 578
  • 1
  • 12
  • 21
  • 1
    "*I don't know if that could ever happen*" - Mathematically speaking, it's almost always a possibility (except with something like [Perfect hash functions](https://en.wikipedia.org/wiki/Perfect_hash_function)). The likelihood is largely a function of the number of items you're hashing and the algorithm you choose. – esqew Nov 19 '20 at 15:54
  • 1
    Possible duplicate of [Probability of hash collision](https://stackoverflow.com/questions/62664761/probability-of-hash-collision). TL;DR - under a ridiculous number of records, it's extremely unlikely you would encounter a collision using any widely-available hashing algorithm. – esqew Nov 19 '20 at 15:56
  • @esqew even if the other question you linked is very informative, I was aware of the birthday paradox applied to hashing functions, but it's not exactly what I'm asking. The question is not "what are the chances.." but "how to be 100% sure to avoid.." – wtfzambo Nov 19 '20 at 17:00
  • *how can I make sure that collision never happens?* You will have to accept a non-zero probability of a collision. sha2-256 is more than enough. If you do find a collision that collision is probably worth more than the data. – President James K. Polk Nov 19 '20 at 19:21

1 Answers1

2

Long answer: what you want to study and explore is called "perfect hashing" (ie hashing guaranteed not to have collisions. https://en.wikipedia.org/wiki/Perfect_hash_function

Short answer: A cryptographic collision resistant algorithm like sha-1 is probably safe to use for all but the largest (PBs a day) datasets and even then its probably all right. Git uses sha-1 internally and code repositories probably deal with the most files on the planet and rarely have collisions. See for details: https://ericsink.com/vcbe/html/cryptographic_hashes.html#:~:text=Git%20uses%20hashes%20in%20two,computed%20when%20it%20was%20stored.

Medium answer: this is actually a pretty hard problem overall and a frequent area of study for computer science and a lot depends on your particular use case and the context you're operating in. Cuckoo hashing, collision resistant algorithms, and hashing in general are probably all good terms to research. There's also a lot of art and science behind space (memory) and time (computer power needed) when picking these methods. A good rule of thumb is that perfect hashing will generally take up more space and time than a collision resistant cryptographic hash like sha-1.

daidoji70
  • 331
  • 4
  • 13