Email deduplication

Question

is it true that e-mail can be deduplicated by just using some of their headers as according to RFC their message-id should be unique?

Is there any way to calculate the chance of 1 single email beeing missed in this deduplication method below (sha512 hash of those 3 headers)?

// $email is a parsed array containing 3 keys (mime headers) -> message_id, subject and date. $hashStr = $email['message_id']; $hashStr .= $email['subject']; $hashStr .= $email['date']; $uniqueEmailId = hash('sha512', $hashStr);

It is kind of mission critical that no single email will be missed, chances are that we are having to deduplicate over several (>2) billion mime files.

score 4 · Answer 1 · answered Apr 15 '14 at 18:38

The SHA512 hash produces a hash value with 512 bits of data. Assuming a random distribution of bits, this works out to more than 1.34e+154 possible values. Even with over 2e+9 samples, the chances of an accidental collision are very near zero.

However, your input for the hash isn't quite that random. message_id is a globally unique identifier which "only" has 5.3e+36 possible values, and the randomness depends on the implementation. According to the wiki link, the odds of a collision are about 50% at 4.2e+18 samples. Email addresses and dates are likely significantly higher than that.

That said, without actually doing the probability math, I would say that the odds are negligible.

Message-ID isn't a GUID in that sense. It's globally unique, but constructed in an implementation-specific manner. Usual technique is to combine a hex timestamp_seq# on the left with the host name on the right of the @ sign. See RFC 2822 pp22-24 — Paul J. Ste. Marie, Aug 18 '16 at 19:13

score 2 · Answer 2 · answered Apr 21 '14 at 21:46

2

If the message-id is already unique, there's little point in hashing (and therefore introducing the admittedly negligible chance of collision).
It seems that a more robust approach would be to use the message-id itself as a basis for comparison.

answered Apr 21 '14 at 21:46

fstd

574
2
7

Wouldn't this combination of messageid, subject and date make the hash collision chance lower? – Floris Apr 23 '14 at 09:26
@Floris no, it would introduce the chance of collision in the first place. I'm not talking about hashing the msgid, but taking it literally, as it already is supposed to be a unique identifier – fstd Apr 23 '14 at 10:04
I know, but I have to handle (very big) amounts of email data (over 10k accounts with 20k messages each), so I need 1 key for an unique message. Unfortunately in my tests it appeared that the Message-ID header isen't really "unique". – Floris Apr 23 '14 at 12:08
1

@Flori uh, so you have non-duplicate email with literally the same Message-ID header? Mind sharing the User-Agent or X-mailer headers of that mail (just so I can stay clear of those programs)? Thanks. Regarding your issue, I guess your proposal of using Message-ID, Subject and Date would be a reasonable thing to do for non-unique Message-IDs, however, it probably does not matter much if you hash it or not. To be /very/ sure, you might also consider the mail body (though then I'd in fact do hash it) – fstd Apr 23 '14 at 13:20

score 0 · Answer 3 · edited Jan 30 '22 at 15:09

Even when RFC says that Message-ID MUST be a globally unique, it is still the sender who generates these headers.

You cannot trust that sender is compliant with RFC (if sender software is not controlled by you).

Primitive sender maybe does not want to generate globally unique Message-ID, or he is sending copies to different recipients using template with identical Message-ID.

I don't know about any email server which verifies uniqueness of Message-ID headers.

If you want to deduplicate email, best solution is to calculate cryptographicaly secure hash of its whole content with enough bits. IMHO 160bits is enough for 2 billion.

Email deduplication

3 Answers3