22

I need a regex or a function in PHP that will validate a string to be a good XML element name.

Form w3schools:

XML elements must follow these naming rules:

  1. Names can contain letters, numbers, and other characters
  2. Names cannot start with a number or punctuation character
  3. Names cannot start with the letters xml (or XML, or Xml, etc)
  4. Names cannot contain spaces

I can write a basic regex that will check for rules 1,2 and 4, but it won't account for all punctuation allowed and won't account for 3rd rule

\w[\w0-9-]

Friendly Update

Here is the more authoritative source for well-formed XML Element names:

Names and Tokens

NameStartChar   ::=
    ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] |
    [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | 
    [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | 
    [#x10000-#xEFFFF]

NameChar    ::=
    NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]

Name    ::=
    NameStartChar (NameChar)*

Also a separate non-tokenized rule is specified:

Names beginning with the string "xml", or with any string which would match (('X'|'x') ('M'|'m') ('L'|'l')), are reserved for standardization in this or future versions of this specification.

Anthony
  • 36,459
  • 25
  • 97
  • 163
Mike Starov
  • 7,000
  • 7
  • 36
  • 37
  • 1
    Did you really get that list from w3schools? Rule #1 is very badly phrased; aside from letters and digits, only a very few punctuation characters are allowed in XML names. – Alan Moore Mar 26 '10 at 01:02
  • 3
    I think the list of constraints is better explained on [this page](http://www.xml.com/pub/a/2001/07/25/namingparts.html) (XML.com). – Honza Javorek Feb 14 '13 at 18:15
  • 3
    you might want to doublecheck the w3schools (known to have lots of factual errors on their site) claims against the actual spec of the W3C (not affiliated with w3schools): http://www.w3.org/TR/REC-xml/#dt-element – Gordon Mar 03 '13 at 18:20

9 Answers9

23

If you want to create valid XML, use the DOM Extension. This way you don't have to bother about any Regex. If you try to put in an invalid name to a DomElement, you'll get an error.

function isValidXmlName($name)
{
    try {
        new DOMElement($name);
        return TRUE;
    } catch(DOMException $e) {
        return FALSE;
    }
}

This will give

var_dump( isValidXmlName('foo') );      // true   valid localName
var_dump( isValidXmlName(':foo') );     // true   valid localName
var_dump( isValidXmlName(':b:c') );     // true   valid localName
var_dump( isValidXmlName('b:c') );      // false  assumes QName

and is likely good enough for what you want to do.

Pedantic note 1

Note the distinction between localName and QName. ext/dom assumes you are using a namespaced element if there is a prefix before the colon, which adds constraints to how the name may be formed. Technically, b:b is a valid local name though because NameStartChar is part of NameChar. If you want to include these, change the function to

function isValidXmlName($name)
{
    try {
        new DOMElement(
            $name,
            null,
            strpos($name, ':') >= 1 ? 'http://example.com' : null
        );
        return TRUE;
    } catch(DOMException $e) {
        return FALSE;
    }
}

Pedantic note 2

Note that elements may start with "xml". W3schools (who is not affiliated with the W3c) apparently got this part wrong (wouldn't be the first time). If you really want to exclude elements starting with xml add

if(stripos($name, 'xml') === 0) return false;

before the try/catch.

Gordon
  • 312,688
  • 75
  • 539
  • 559
  • This introduce lots of overhead for just checking an element name. I do use DOM objects when I am ready to do actual XML processing. – Mike Starov Mar 31 '10 at 17:36
  • 9
    @xsaero00 well, first of all: we usually don't downvote all answers we didn't accept. All of the answers given contain valid approaches to your problem. Second, I have benchmarked my solution (incl. strpos) versus the accepted solution and incidentally my solution is 250% faster. If you don't believe it, do a benchmark yourself. – Gordon Mar 31 '10 at 18:00
  • Actually, w3schools are basically right about not starting with "xml" (although wrong about other details) - those names are valid, but specially reserved by the spec; the only legal use I know of is `xmlns` and the `xmlns:` prefix, defined by the XML Namespaces spec as attribute names. – IMSoP Mar 31 '15 at 18:53
19

This has been missed so far despite the fact the question is that old: Name validation via PHP's pcre functions that are streamlined with the XML specification.

XML's definition is pretty clear about the element name in it's specs (Extensible Markup Language (XML) 1.0 (Fifth Edition)):

[4]  NameStartChar  ::=   ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a] NameChar       ::=   NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
[5]  Name           ::=   NameStartChar (NameChar)*

This notation can be transposed into a UTF-8 compatible regular expression to be used with preg_match, here as single-quoted PHP string to be copied verbatim:

'~^[:A-Z_a-z\\xC0-\\xD6\\xD8-\\xF6\\xF8-\\x{2FF}\\x{370}-\\x{37D}\\x{37F}-\\x{1FFF}\\x{200C}-\\x{200D}\\x{2070}-\\x{218F}\\x{2C00}-\\x{2FEF}\\x{3001}-\\x{D7FF}\\x{F900}-\\x{FDCF}\\x{FDF0}-\\x{FFFD}\\x{10000}-\\x{EFFFF}][:A-Z_a-z\\xC0-\\xD6\\xD8-\\xF6\\xF8-\\x{2FF}\\x{370}-\\x{37D}\\x{37F}-\\x{1FFF}\\x{200C}-\\x{200D}\\x{2070}-\\x{218F}\\x{2C00}-\\x{2FEF}\\x{3001}-\\x{D7FF}\\x{F900}-\\x{FDCF}\\x{FDF0}-\\x{FFFD}\\x{10000}-\\x{EFFFF}.\\-0-9\\xB7\\x{0300}-\\x{036F}\\x{203F}-\\x{2040}]*$~u'

Or as another variant with named subpatterns in a more readable fashion:

'~
# XML 1.0 Name symbol PHP PCRE regex <http://www.w3.org/TR/REC-xml/#NT-Name>
(?(DEFINE)
    (?<NameStartChar> [:A-Z_a-z\\xC0-\\xD6\\xD8-\\xF6\\xF8-\\x{2FF}\\x{370}-\\x{37D}\\x{37F}-\\x{1FFF}\\x{200C}-\\x{200D}\\x{2070}-\\x{218F}\\x{2C00}-\\x{2FEF}\\x{3001}-\\x{D7FF}\\x{F900}-\\x{FDCF}\\x{FDF0}-\\x{FFFD}\\x{10000}-\\x{EFFFF}])
    (?<NameChar>      (?&NameStartChar) | [.\\-0-9\\xB7\\x{0300}-\\x{036F}\\x{203F}-\\x{2040}])
    (?<Name>          (?&NameStartChar) (?&NameChar)*)
)
^(?&Name)$
~ux'

Note that this pattern contains the colon : which you might want to exclude (two appereances in the first pattern, one in the second) for XML Namespace validation reasons (e.g. a test for NCName).

Usage Example:

$name    = '::...';
$pattern = '~
# XML 1.0 Name symbol PHP PCRE regex <http://www.w3.org/TR/REC-xml/#NT-Name>
(?(DEFINE)
    (?<NameStartChar> [:A-Z_a-z\\xC0-\\xD6\\xD8-\\xF6\\xF8-\\x{2FF}\\x{370}-\\x{37D}\\x{37F}-\\x{1FFF}\\x{200C}-\\x{200D}\\x{2070}-\\x{218F}\\x{2C00}-\\x{2FEF}\\x{3001}-\\x{D7FF}\\x{F900}-\\x{FDCF}\\x{FDF0}-\\x{FFFD}\\x{10000}-\\x{EFFFF}])
    (?<NameChar>      (?&NameStartChar) | [.\\-0-9\\xB7\\x{0300}-\\x{036F}\\x{203F}-\\x{2040}])
    (?<Name>          (?&NameStartChar) (?&NameChar)*)
)
^(?&Name)$
~ux';

$valid = 1 === preg_match($pattern, $name); # bool(true)

The saying that an element name starting with XML (in lower or uppercase letters) would not be possible is not correct. <XML/> is a perfectly well-formed XML and XML is a perfectly well-formed element name.

It is just that such names are in the subset of well-formed element names that are reserved for standardization (XML version 1.0 and above). It is easy to test if a (well-formed) element name is reserved with a string comparison:

$reserved = $valid && 0 === stripos($name, 'xml'));

or alternatively another regular expression:

$reserved = $valid && 1 === preg_match('~^[Xx][Mm][Ll]~', $name);

PHP's DOMDocument can not test for reserved names at least I don't know any way how to do that and I've been looking a lot.

A valid element name needs a Unique Element Type Declaration which seems to be out of the scope of the question here as no such declaration has been provided. Therefore the answer does not take care of that. If there would be an element type declaration, you would only need to validate against a white-list of all (case-sensitive) names, so this would be a simple case-sensitive string-comparison.


Excursion: What does DOMDocument do different to the Regular Expression?

In comparison with a DOMDocument / DOMElement, there are some differences what qualifies a valid element name. The DOM extension is in some kind of mixed-mode which makes it less predictable what it validates. The following excursion illustrates the behavior and shows how to control it.

Let's take $name and instantiate an element:

$element = new DOMElement($name);

The outcome depends:

So the first character decides about the comparison mode.

A regular expression is specifically written what to check for, here the XML 1.0 Name symbol.

You can achieve the same with DOMElement by prefixing the name with a colon:

function isValidXmlName($name)
{

    try {
        new DOMElement(":$name");
        return TRUE;
    } catch (DOMException $e) {
        return FALSE;
    }
}

To explicitly check for the QName this can be achieved by turning it into a PrefixedName in case it is a UnprefixedName:

function isValidXmlnsQname($qname)
{
    $prefixedName = (!strpos($qname, ':') ? 'prefix:' : '') . $qname;

    try {
        new DOMElement($prefixedName, NULL, 'uri:ns');
        return TRUE;
    } catch (DOMException $e) {
        return FALSE;
    }
}
hakre
  • 193,403
  • 52
  • 435
  • 836
8

How about

/\A(?!XML)[a-z][\w0-9-]*/i

Usage:

if (preg_match('/\A(?!XML)[a-z][\w0-9-]*/i', $subject)) {
    # valid name
} else {
    # invalid name
}

Explanation:

\A  Beginning of the string
(?!XML)  Negative lookahead (assert that it is impossible to match "XML")
[a-z]  Match a non-digit, non-punctuation character
[\w0-9-]*  Match an arbitrary number of allowed characters
/i  make the whole thing case-insensitive
Leo
  • 37,640
  • 8
  • 75
  • 100
  • 13
    This doesn’t match <äøñ> which is a valid Nmtoken as of XML 1.1. See http://www.w3.org/TR/xml11/#sec-common-syn – fuxia Mar 25 '10 at 22:55
  • This expression with some mods for unicode plus filter_var() should do the job. Thanks. – Mike Starov Mar 31 '10 at 17:37
  • 1
    I added my [answer with an Unicode compatible PCRE regex](http://stackoverflow.com/a/15188815/367456). – hakre Mar 03 '13 at 18:02
  • 2
    This also doesn't mention '.' (period/full stop), which is also valid in XML element names. – James M. Greene May 03 '13 at 17:44
  • for unicode in regexes there is `\p{L}` for letters and `\p{N}` for numbers. they should match everything the unicode spec considers letters or numbers. That might not be the same thing as xml 1.1 considers letters/numbers, I don't know enough about the spec – Tim Seguine Dec 20 '13 at 09:57
  • Given the issues others have pointed out, I strongly recommend [Gordon's answer](http://stackoverflow.com/a/2519943/617159) instead. – Lambda Fairy Aug 20 '14 at 08:25
1

If you are using the DotNet framework try XmlConvert.VerifyName. It will tell you if the name is valid, or use XmlConvert.EncodeName to actually convert an invalid name into a valid one...

Keith Vinson
  • 771
  • 5
  • 10
0

The expression below should match valid unicode element names excepting xml. Names that start or end with xml will still be allowed. This passes @toscho's äøñ test. The one thing I could not figure out a regex for was extenders. The xml element name spec says:

[4] NameChar ::= Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender

[5] Name ::= (Letter | '_' | ':') (NameChar)*

But there's no clear definition for a unicode category or class containing extenders.

^[\p{L}_:][\p{N}\p{L}\p{Mc}.\-|:]*((?<!xml)|xml)$
JamieSee
  • 12,696
  • 2
  • 31
  • 47
0

XML, xml and etc are valid tags, they are just "reserved for standardization in this or future versions of this specification" which likely will never happen. Please check the real standard at https://www.w3.org/TR/REC-xml/. The w3school article is inaccurate.

0

Use this regex:

^_?(?!(xml|[_\d\W]))([\w.-]+)$

This matches all your four points and allows unicode characters.

Darren
  • 68,902
  • 24
  • 138
  • 144
-1

This should give you roughly what you need [Assuming you are using Unicode]:
(Note: This is completely untested.)

[^\p{P}xX0-9][^mMlL\s]{2}[\w\p{P}0-9-]

\p{P} is the syntax for Unicode Punctuation marks in PHP's regular expression syntax.

Sean Vieira
  • 155,703
  • 32
  • 311
  • 293
  • Among other problems, that won't match anything that starts with 'x' or has 'm' or 'l' as the second or third characters. That disallows a lot more than just "xml". – Alan Moore Mar 26 '10 at 00:57
  • @Alan; very valid point. Could you use negative look-aheads instead? (More for curiosity than anything else. Gordon's way is far better than what I posted off-hand.) – Sean Vieira Mar 26 '10 at 01:14
  • 1
    That's right. @Mef's answer has its own problems, but it demonstrates how to use a lookahead for that part of the job. – Alan Moore Mar 26 '10 at 03:12
-3
if (substr(strtolower($text), 0, 3) != 'xml') && (1 === preg_match('/^\w[^<>]+$/', $text)))
{
    // valid;
}
Jacob Relkin
  • 161,348
  • 33
  • 346
  • 320
Amy B
  • 17,874
  • 12
  • 64
  • 83