is it true that e-mail can be deduplicated by just using some of their headers as according to RFC their message-id should be unique?
Is there any way to calculate the chance of 1 single email beeing missed in this deduplication method below (sha512 hash of those 3 headers)?
// $email is a parsed array containing 3 keys (mime headers) -> message_id, subject and date.
$hashStr = $email['message_id'];
$hashStr .= $email['subject'];
$hashStr .= $email['date'];
$uniqueEmailId = hash('sha512', $hashStr);
It is kind of mission critical that no single email will be missed, chances are that we are having to deduplicate over several (>2) billion mime files.