Our webapp has a feature that allows users to import data by sending emails to a specific email address. When the emails are received by our app, they are processed differently depending on who sent them. We look at the "sender" field of the email, and match it to a user in our database. Once the user who sent the email has been determined, we handle that email based on that user's personal settings.
This has generally been working fine for most users. However, certain users were complaining that their emails weren't getting processed. When we looked into it, we found that their email server was adding information to the senders email address, and this caused the email address not to match what was in our User table in the database. For example, the user's email might be testuser@example.com
in the database, but the "sender" field in the email we received would be something like btv1==502867923ab==testuser@example.com
. Some research suggested this was caused by Bounce Address Tag Validation (BATV) being used by the sender's server.
We need to be able to extract the canonical email address from the "sender" field provided to us, so we can match it to our user table. One of the other developers here wrote a function to do this, and submitted it to me for code review. This is what he wrote (C#):
private static string SanitizeEmailSender(string sender)
{
if (sender == null)
return null;
return System.Text.RegularExpressions.Regex.Replace(
sender,
@"^((btv1==.{11}==)|(prvs=.{9}=))",
"",
System.Text.RegularExpressions.RegexOptions.None);
}
The regex pattern here covers the specific cases we've seen in our email logs. My concern is that the regex might be too specific. Are btv1
and prvs
the only prefixes used in these tags? Are there always exactly 9 characters after prvs=
? Are there other email sender tagging schemes other than BATV that we need to look out for? What I don't want is to put this fix in production just to find out next month that we need to fix it again because there were other cases we didn't consider.
My gut instinct was to just trim the email address to only include the part after the last =
. However, research suggests that =
is a valid character in email addresses and thus may be part of the user's canonical email address. I personally have never seen =
used in an email address outside some kind of tagging or sub-addressing scheme, but you never know. Murphy's law suggests that the minute I assume a user will never have a certain character in their email address, somebody with that sort of address will immediately sign up.
My question is: is there a industry-accepted reliable way to extract a user's canonical email address given a longer address that may include BATV or other tags? Failing that, is there at least a more reliable way than what we've got so far? Or is what we've got actually sufficient?