4

I am trying to figure out a way where I can find all the invalid characters in an XML. According to W3 recommendation these are the valid characters in an XML:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Converting it to decimal:

9
10
13
32-55295
57344-65533
65536-1114111

are the valid xml characters.

I am trying to search in notepad++ using the appropriate regular expression for the invalid characters.

A snippet from my XML:

        <custom-attribute attribute-id="isContendFeed">fal &#11; se</custom-attribute>
        <custom-attribute attribute-id="pageNoFollow">fal &#3; se</custom-attribute>
        <custom-attribute attribute-id="pageNoIndex">fal &#13; se</custom-attribute>
        <custom-attribute attribute-id="rrRecommendable">false</custom-attribute>

From the above example I want that my regular expression finds &#11; and &#3; for me because these are not allowed in an XML.

I am not able to construct the regular expression for this.

The regular expression I made for the numeric ranges:

32-55295 : (3[2-9]|[4-9][0-9]|[1-9][0-9]{2,3}|[1-4][0-9]{4}|5[0-4][0-9]{3}|55[01][0-9]{2}|552[0-8][0-9]|5529[0-5])
57344-65533 : (5734[4-9]|573[5-9][0-9]|57[4-9][0-9]{2}|5[89][0-9]{3}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-3])
65536-1114111 : (6(5(5(3[6-9]|[4-9][0-9])|[6-9][0-9]{2})|[6-9][0-9]{3})|[7-9][0-9]{4}|[1-9][0-9]{5}|1(0[0-9]{5}|1(0[0-9]{4}|1([0-3][0-9]{3}|4(0[0-9]{2}|1(0[0-9]|1[01])))))))

These regular expression are working if used separately but I am not able to make the complete regex.

Is there any other way other than the regular expression by which I can find the invalid characters? If not, please help me in constructing the regular expression which can find invalid characters present in my XML.

Vikas Mangal
  • 821
  • 3
  • 10
  • 23
  • you could just launch a validating tool like `xmllint` on it – guido May 14 '15 at 05:18
  • I found a notepad++ plugin [XMLTools](http://sourceforge.net/projects/npp-plugins/files/XML%20Tools/Xml%20Tools%202.4.6%20Unicode/) which served the purpose. The only problem is it gives the invalid characters one my one, not all in one go. – Vikas Mangal May 14 '15 at 05:55
  • do you mean that invalid characters are numbers `1-8` `11,12` `14-31` `55296-57343` `65534,65535` `and any number greater than 1114111` – Nader Hisham May 14 '15 at 06:00
  • @NaderHisham No. Invalid characters are characters having decimal code among the numbers you mentioned. Look at the XML in the question. [See this](http://www.w3.org/TR/xml/#charsets). I have only converted these from hex to decimal. – Vikas Mangal May 14 '15 at 06:18

1 Answers1

1

first, the literal text &#3; is allowed in xml - not allowed (if the list is correct) is the character with the ascii-code 3. Hope I got that right.

Second. Most regular expression flavors allow to search for characters that can be defined with \x00 (two hex digits) and \u0000 (4 hex digits). Some flavors allow something like \x{...} - but it differs from flavor to flavor...

We start with

[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD]

[^] defines a negated set of characters and character ranges (and more). Simply fill it with all the allowed characters and ranges.

If your flavor understands \x{}, it's easy to extend.

[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\x{10000}-\x{10FFFF}]

Otherwise you have to search for the surrogate pairs characters by character...

\x{10000} is the same as \uD800\uDC00

\x{10FFFF} is the same as \uDBFF\uDFFF

That could not be done in a single set. No fun ;) It's something like the negated version of

[\uD800-\uDBFF][\uDC00-\uDFFF]|
[\uD800-\uDBFF](?![\uDC00-\uDFFF])|
(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]

(from https://mathiasbynens.be/notes/javascript-unicode#matching-code-points)

Wolfgang Kluge
  • 895
  • 8
  • 13