How can I extract the canonical email address given an address that includes BATV or other tags?

Question

Our webapp has a feature that allows users to import data by sending emails to a specific email address. When the emails are received by our app, they are processed differently depending on who sent them. We look at the "sender" field of the email, and match it to a user in our database. Once the user who sent the email has been determined, we handle that email based on that user's personal settings.

This has generally been working fine for most users. However, certain users were complaining that their emails weren't getting processed. When we looked into it, we found that their email server was adding information to the senders email address, and this caused the email address not to match what was in our User table in the database. For example, the user's email might be testuser@example.com in the database, but the "sender" field in the email we received would be something like btv1==502867923ab==testuser@example.com. Some research suggested this was caused by Bounce Address Tag Validation (BATV) being used by the sender's server.

We need to be able to extract the canonical email address from the "sender" field provided to us, so we can match it to our user table. One of the other developers here wrote a function to do this, and submitted it to me for code review. This is what he wrote (C#):

private static string SanitizeEmailSender(string sender)
{
    if (sender == null)
        return null;
    return System.Text.RegularExpressions.Regex.Replace(
        sender, 
        @"^((btv1==.{11}==)|(prvs=.{9}=))", 
        "", 
        System.Text.RegularExpressions.RegexOptions.None);
}

The regex pattern here covers the specific cases we've seen in our email logs. My concern is that the regex might be too specific. Are btv1 and prvs the only prefixes used in these tags? Are there always exactly 9 characters after prvs=? Are there other email sender tagging schemes other than BATV that we need to look out for? What I don't want is to put this fix in production just to find out next month that we need to fix it again because there were other cases we didn't consider.

My gut instinct was to just trim the email address to only include the part after the last =. However, research suggests that = is a valid character in email addresses and thus may be part of the user's canonical email address. I personally have never seen = used in an email address outside some kind of tagging or sub-addressing scheme, but you never know. Murphy's law suggests that the minute I assume a user will never have a certain character in their email address, somebody with that sort of address will immediately sign up.

My question is: is there a industry-accepted reliable way to extract a user's canonical email address given a longer address that may include BATV or other tags? Failing that, is there at least a more reliable way than what we've got so far? Or is what we've got actually sufficient?

score 1 · Answer 1 · answered Apr 10 '15 at 05:29

As the information added by BATV is always preceded by the BATV tag and delimiting the information between two == strings, this is what I should use:

((btv1|prvs)==([^=]|=[^=])*==))

Of course, you are right in the sense that an = sign is admitted as a valid character in an email addres, but that's preciselly the reason to use that sequence (to form a valid email address).

If you try to dig a little more in RFCs relating to email, you'll se that MIME adds some constructs to allow non-ascii characters to an email address by use of the quoted-printable feature. A little of RFC reading is needed to select how to cope right with these things.

Finally, to answer your question, as the mail servers are authorised to modify/rewrite the envelope addresses ---these are the addresses in the control protocol SMTP used for routing of mail messages--- (sendmail can do it even in the mail header fields) The right answer to your question is that there's no reliable way (industrial accepted or not) to extract the sender canonical email address. Addresses are rewritten as message progresses to the target recipient and information is lost in the way. You cannot recover the original address used.

And last, to illustrate a little:

Sender field is added by the final SMTP recipient to include in the email the address of the envelope sender (the address used as FROM: <sender@address.com> in the original SMTP protocol message)
From field is added by the original mail client to identify the origin of the message. This behaviour can be modified by the existence of Resent-from or Resent-sender fields in case the message is resent. These identify the resend of messages.
Finally, the sender can use a Reply-to header to indicate responses to be sent to that address.

To get an idea of how the SMTP protocol works, read the dense RFC-2821 (SMTP protocol) and RFC-2822 (format of internet mail messages) documents.

score 1 · Answer 2 · edited Oct 07 '21 at 07:27

Are btv1 and prvs the only prefixes used in these tags?

prvs is a prefix that conform to the "meta-syntax" defined in the RFC. btv1 is a Barracuda appliance Invalid Spoof Suppression rewrite which doesn't follow the BATV standard (hence the double equal sign).

A regex that just matches all BATV local-parts would be

[0-9A-Za-z\-]+=[0-9A-Za-z\-]+=.+@.+]

But this wouldn't catch the Barracuda btv1 rewrites (and other rewrites)

Are there always exactly 9 characters after prvs=?

No, the spec says there are 10 but in the wild it's most often 9

Are there other email sender tagging schemes other than BATV that we need to look out for?

Yes, see below.

is there a industry-accepted reliable way to extract a user's canonical email address given a longer address that may include BATV or other tags?

No

By looking at various code bases it looks like everybody implements their own solution. Some of the complexity comes from the fact that there are

the BATV rewrites
BATV rewrites which try but fail to follow the standard by swapping the loc-core and tag-val positions. Here is an example showing these reversed versions and some code which validates each to see if it's a prvs value and then assumes the other one is the loc-core
the Barracuda non standard rewrites
other non BATV rewrites like
- SRS
- Google Forwards

Here's a unit test containing a list of possible sender rewritten examples and here are some examples of syntaxes found in the wild.

Failing that, is there at least a more reliable way than what we've got so far? Or is what we've got actually sufficient?

It looks like best approach is to address each of the conditions in the way that ezmlm-idx and rspamd do.

The regex you're using won't cover

prvs with loc-core and tag-val reversed
prvs that follow the spec with 10 characters instead of 9
SRS
Google forwards

How can I extract the canonical email address given an address that includes BATV or other tags?

2 Answers2

Linked