3

I made an XML Schema and I have this in it.

<xs:element name="Email">
        <xs:simpleType>
          <xs:restriction base="xs:string">
            <xs:pattern value="\w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*"/>
          </xs:restriction>
        </xs:simpleType>
      </xs:element>

Some of my emails in one of my XML documents fail and I get this error

Email' element is invalid - The value 'Some_Name@hotmail.com' is invalid according to its datatype 'String' - The Pattern constraint failed. LineNumber: 15404 LinePostion: 32

So just looking at all the emails that passed and the ones that failed I noticed that all the ones that failed have an "_(underscore)". So I am unsure if this is the reason or not.

Edit

So I changed my regex to this

 <xs:pattern value="[\w_]+([-+.'][\w_]+)*@[\w_]+([-.][\w_]+)*\.[\w_]+([-.][\w_]+)*"/>

It now works but don't understand why \w is not capturing it.

NakedBrunch
  • 48,713
  • 13
  • 73
  • 98
chobo2
  • 83,322
  • 195
  • 530
  • 832
  • 4
    It looks like you've already identified the problem - your regex doesn't mention underscores at all. – Greg Hewgill Jul 29 '10 at 21:49
  • Shouldn't the character class `\w` include underscores? – Donald Miner Jul 29 '10 at 21:50
  • Hmm it is weird seeing I use a program expresso to help me write my regex and it catches things with underscores. Plus I think this one I just got from .net email validator. Plus I think orangeoctopus is right \w should catch it. – chobo2 Jul 29 '10 at 21:53

5 Answers5

6

The W3C Recommendation on datatypes defines \w as:

[#X0000-#x10FFFF]-[\p{P}\p{Z}\p{C}] (all characters except the set of "punctuation", "separator" and "other" characters)*

The underscore character definition in Unicode is 'LOW LINE' (U+005F), category: punctuation, connector [Pc]

so XML Schema handles character classes more in accordance with Unicode definitions.

But for e-mail regexp, you shold use strict ASCII, like [0-9A-Za-z_-] intead of \w (I bet email address with nonlatin characters is invalid :) ), yet better is to find a proven regexp syntax, or look into RFC, what is the proper e-mail format

Adam Katz
  • 14,455
  • 5
  • 68
  • 83
mykhal
  • 19,175
  • 11
  • 72
  • 80
  • I updated the formatting and link above but did not contradict the content. The last paragraph is incorrect; see [RFC 5336](https://tools.ietf.org/html/rfc5336), which covers [internationalized email addresses](https://en.wikipedia.org/wiki/International_email) (though the actual representation in email headers must be encoded since [RFC 5322](https://tools.ietf.org/html/rfc5322) requires headers all be ASCII). Also note that a _comprehensive_ regex to match all possible addresses is pretty much impossible. – Adam Katz Mar 16 '16 at 17:17
1

Something is weird because \w typically accepts underscores. Try to add _ to the \w that you would be expecting the _ in, by changing them to [\w_].

Donald Miner
  • 38,889
  • 8
  • 95
  • 118
0

Could very well be, because your regex wont recognize an email w/ an underscore.

Check out this topic: Using a regular expression to validate an email address

It's one I have bookmarked for how useful it is.

Community
  • 1
  • 1
NinjaCat
  • 9,974
  • 9
  • 44
  • 64
0

Yes. You do not match the underscore character. Just try to add it...

\w+([-+.'_]\w+)*...
relet
  • 6,819
  • 2
  • 33
  • 41
0

Something is in fact strange; since the \w character class includes underscores, as we can see with Rubular, the email you have should validate. Is it possible there's another problem—a stray space, for instance? However, the other problem with this is that there is no regular expression which correctly accepts all email addresses and nothing else; this Stack Overflow question has a good answer. There may be a better way to deal with validating email addresses than this schema/regex.

Community
  • 1
  • 1
Antal Spector-Zabusky
  • 36,191
  • 7
  • 77
  • 140
  • Hmm I don't think there are any stray spaces(non that I can see). I added "_" to include this and it works(see my edit) – chobo2 Jul 30 '10 at 00:25