10

So I have a string that I need to split by semicolon's

Email address: "one@tw;,.'o"@hotmail.com;"some;thing"@example.com

Both of the email addresses are valid

So I want to have a List<string> of the following:

  • "one@tw;,.'o"@hotmail.com
  • "some;thing"@example.com

But the way I am currently splitting the addresses is not working:

var addresses = emailAddressString.Split(new[] { ';' }, StringSplitOptions.RemoveEmptyEntries)
                .Select(x => x.Trim()).ToList();

Because of the multiple ; characters I end up with invalid email addresses.

I have tried a few different ways, even going down working out if the string contains quotes and then finding the index of the ; characters and working it out that way, but it's a real pain.

Does anyone have any better suggestions?

Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
Jamie Rees
  • 7,973
  • 2
  • 45
  • 83
  • 1
    My suggestion would be to make sure that your delimiter character does not show up anywhere else other than to mark the boundary between emails, so emails with `;` as part of their name (e.g. "some;thing@example.com") should not be allowed. Otherwise, find a different delimiter character, like a pipe `|`? – code_dredd Nov 11 '15 at 13:36
  • RegEx to the rescue? Maybe you can adapt: http://stackoverflow.com/questions/7430186/regex-split-string-with-on-a-delimetersemi-colon-except-those-that-appear-in – Corak Nov 11 '15 at 13:38
  • 1
    Try the following: `(^|;)(.*?)@([\d\w]+[-]*)+\.\w+` – Camo Nov 11 '15 at 13:39
  • @SonerGönül close but it has removed the quotes from the 2nd address. – Jamie Rees Nov 11 '15 at 13:40
  • 1
    @ray unfortunatly I need to use a `;` character and it is valid inside an email address – Jamie Rees Nov 11 '15 at 13:41
  • @JamieR: Can you comment on why you'd want to treat a valid character for the email also as a delimiter simultaneously? This seems to be an ambiguous case. – code_dredd Nov 11 '15 at 13:43
  • 1
    @ray Business requirement. – Jamie Rees Nov 11 '15 at 13:44
  • @JamieR: I get it's required for you, but it's not too illuminating, I'm afraid. At least for me, it does not clarify the ambiguity of giving a double-meaning to the same char, imho. It seems no different than choosing any other letter in the alphabet, like `a` or `d`. – code_dredd Nov 11 '15 at 13:47
  • @ray, It's completely reasonably to have delimited data where the delimiter can be in the date if the delimiter is somehow escaped either by a preceding escape character or by enclosing it between other special characters (in this case double quotes). – juharr Nov 11 '15 at 13:49
  • 1
    @juharr: I agree with that, but normally outside of email context (e.g. csv files). This is at least the *first* time I ever see this kind of mess in an email address setting, especially since I thought it'd be pretty straightforward to follow "normal" practices regarding email address formats. E.g. some email clients (e.g. Outlook) use `;` to delimit addresses when you're adding recipients, so I probably wouldn't be able to send emails to such "valid" addresses if they were to show up. – code_dredd Nov 11 '15 at 13:54
  • @ray https://en.wikipedia.org/wiki/Email_address#Examples – Jamie Rees Nov 11 '15 at 14:01
  • 1
    @ray That's why the part before the @ is in quotes. To tell it that the semicolons inside are not meant as delimiters. – juharr Nov 11 '15 at 14:18
  • @ray The right reaction to someone trying to follow a clearly defined standard isn't "just ignore it to make your life easier!". I mean sure tons of people do (thanks), but that just means that I can't for example use `+` when specifying my gmail address to easily separate mail I receive from different sources most of the time because apparently nobody thought that + might be a valid character in an email (and don't even try more exotic things). – Voo Nov 11 '15 at 18:55
  • @Voo: Mind explaining where you get the idea that I recommend "ignoring" things to "make life easier"? Because I never did. I had simply, by experience, not encountered the situation where characters that I normally see used as delimiters would suddenly show up as part of a valid email and I've certainly never seen, sent, or received emails from such addresses --until this post and the wiki link by Jamie. That being said, there're some pretty brain-dead standards out there, too, and they deserve to get ignored. – code_dredd Nov 11 '15 at 19:06
  • @ray Well you did say "emails with ; as part of their name [..] should not be allowed." - not "emails with ; aren't valid so reject them". This is a sore topic since it's one of those topics where people who just don't bother reading the standard before implementing something make my life more difficult. Now while some things are possibly not that useful, making that decision should be carefully thought through since there are millions of use cases and it's really hard (impossible I'd say) to consider all of them. I'm sure someone else also thought "who'd ever have a + in their email address?" – Voo Nov 11 '15 at 19:40
  • @Voo: It was mentioned, thought the context of the comment was a question. In any case, the OP said it was a 'business requirement' and regardless of what the standard actually has written in it, this would be the first IT Dept that *I* see has chosen to allow those chars instead of the usual `first.last@domain.com` or something along those lines. The fact that I mentioned my personal experience and first hand observation is unrelated to the text of the standard, imho. – code_dredd Nov 11 '15 at 21:25

3 Answers3

13

Assuming that double-quotes are not allowed, except for the opening and closing quotes ahead of the "at" sign @, you can use this regular expression to capture e-mail addresses:

((?:[^@"]+|"[^"]*")@[^;]+)(?:;|$)

The idea is to capture either an unquoted [^@"]+ or a quoted "[^"]*" part prior to @, and then capture everything up to semicolon ; or the end anchor $.

Demo of the regex.

var input = "\"one@tw;,.'o\"@hotmail.com;\"some;thing\"@example.com;hello@world";
var mm = Regex.Matches(input, "((?:[^@\"]+|\"[^\"]*\")@[^;]+)(?:;|$)");
foreach (Match m in mm) {
    Console.WriteLine(m.Groups[1].Value);
}

This code prints

"one@tw;,.'o"@hotmail.com
"some;thing"@example.com
hello@world

Demo 1.

If you would like to allow escaped double-quotes inside double-quotes, you could use a more complex expression:

((?:(?:[^@\"]|(?<=\\)\")+|\"([^\"]|(?<=\\)\")*\")@[^;]+)(?:;|$)

Everything else remains the same.

Demo 2.

Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
  • Thank you for your help. Been pulling my hair out about this one! – Jamie Rees Nov 11 '15 at 13:55
  • What if double-quotes are allowed? – Jamie Rees Nov 11 '15 at 14:09
  • @JamieR It depends on the rules of allowing extra double-quotes. If extra double-quotes are allowed inside quoted strings, but they must be escaped, then this part `"[^"]*"` of the regex would become a lot trickier, but still workable. Allowing unrestricted double-quotes everywhere would be ambiguous. – Sergey Kalinichenko Nov 11 '15 at 14:13
  • 1
    @JamieR [Here is a demo](http://ideone.com/XUTcno) of an expression that allows escaped quotes inside or outside quoted strings. Note that you need to un-escape these quotes in you code, because they are transferred to the output unchanged. – Sergey Kalinichenko Nov 11 '15 at 14:19
4

I obviously started writing my anti regex method at around the same time as juharr (Another answer). I thought that since I already have it written I would submit it.

    public static IEnumerable<string> SplitEmailsByDelimiter(string input, char delimiter)
    {
        var startIndex = 0;
        var delimiterIndex = 0;

        while (delimiterIndex >= 0)
        {
            delimiterIndex = input.IndexOf(';', startIndex);
            string substring = input;
            if (delimiterIndex > 0)
            {
                substring = input.Substring(0, delimiterIndex);
            }

            if (!substring.Contains("\"") || substring.IndexOf("\"") != substring.LastIndexOf("\""))
            {
                yield return substring;
                input = input.Substring(delimiterIndex + 1);
                startIndex = 0;
            }
            else
            {
                startIndex = delimiterIndex + 1;
            }
        }
    }

Then the following

            var input = "blah@blah.com;\"one@tw;,.'o\"@hotmail.com;\"some;thing\"@example.com;hello@world;asdasd@asd.co.uk;";
            foreach (var email in SplitEmailsByDelimiter(input, ';'))
            {
                Console.WriteLine(email);
            }

Would give this output

blah@blah.com
"one@tw;,.'o"@hotmail.com
"some;thing"@example.com
hello@world
asdasd@asd.co.uk
Jamie Rees
  • 7,973
  • 2
  • 45
  • 83
Darren Gourley
  • 1,798
  • 11
  • 11
3

You can also do this without using regular expressions. The following extension method will allow you to specify a delimiter character and a character to begin and end escape sequences. Note it does not validate that all escape sequences are closed.

public static IEnumerable<string> SpecialSplit(
    this string str, char delimiter, char beginEndEscape)
{
    int beginIndex = 0;
    int length = 0;
    bool escaped = false;
    foreach (char c in str)
    {
        if (c == beginEndEscape)
        {
            escaped = !escaped;
        }
            
        if (!escaped && c == delimiter)
        {
            yield return str.Substring(beginIndex, length);
            beginIndex += length + 1;
            length = 0;
            continue;
        }

        length++;
    }

    yield return str.Substring(beginIndex, length);
}

Then the following

var input = "\"one@tw;,.'o\"@hotmail.com;\"some;thing\"@example.com;hello@world;\"D;D@blah;blah.com\"";
foreach (var address in input.SpecialSplit(';', '"')) 
    Console.WriteLine(v);

While give this output

"one@tw;,.'o"@hotmail.com

"some;thing"@example.com

hello@world

"D;D@blah;blah.com"

Here's the version that works with an additional single escape character. It assumes that two consecutive escape characters should become one single escape character and it's escaping both the beginEndEscape charter so it will not trigger the beginning or end of an escape sequence and it also escapes the delimiter. Anything else that comes after the escape character will be left as is with the escape character removed.

public static IEnumerable<string> SpecialSplit(
    this string str, char delimiter, char beginEndEscape, char singleEscape)
{
    StringBuilder builder = new StringBuilder();
    bool escapedSequence = false;
    bool previousEscapeChar = false;
    foreach (char c in str)
    {
        if (c == singleEscape && !previousEscapeChar)
        {
            previousEscapeChar = true;
            continue;
        }

        if (c == beginEndEscape && !previousEscapeChar)
        {
            escapedSequence = !escapedSequence;
        }

        if (!escapedSequence && !previousEscapeChar && c == delimiter)
        {
            yield return builder.ToString();
            builder.Clear();
            continue;
        }

        builder.Append(c);
        previousEscapeChar = false;
    }

    yield return builder.ToString();
}

Finally you probably should add null checking for the string that is passed in and note that both will return a sequence with one empty string if you pass in an empty string.

Community
  • 1
  • 1
juharr
  • 31,741
  • 4
  • 58
  • 93
  • What if inside the `"` there is another `"` e.g. `"very.(),:;<>[]\".VERY.\"very@\\ \"very\".unusual"@strange.example.com` – Jamie Rees Nov 11 '15 at 14:17
  • In that case you also need to tell it that there's an escape character for the double quote. Also you have to then think about what can and cannot be escaped. Presumeable "\\" will give you a single backslash, but what about "\t". Do you want a tab or just a single t? – juharr Nov 11 '15 at 14:20
  • 3
    Also I'd probably abandon using `string.Substring` and instead use a `StringBuilder` to add characters as I loop through the data. – juharr Nov 11 '15 at 14:26