4

Yesterday I made a question about Detect non valid XML characters in java, and this expression works as expected:

String xml10pattern = "[^"
                + "\u0009\r\n" // #x9 | #xA | #xD 
                + "\u0020-\uD7FF" // [#x20-#xD7FF]
                + "\uE000-\uFFFD" // [#xE000-#xFFFD] 
                + "\ud800\udc00-\udbff\udfff" // [#x10000-#x10FFFF]
                + "]";

However, I realized it would be better checking for invalid characters on client side using javascript, but I didn't succeed.

I almost achieved, except for range U+10000–U+10FFFF: http://jsfiddle.net/mymxyjaf/15/

For last range, I tried

 var rg = /[^\u0009\r\n\u0020-\uD7FF\uE000-\uFFFD\ud800\udc00-\udbff\udfff]/g; 

but it doesn't work. In regextester, tells "Range values reversed". I think it is because \ud800\udc00-\udbff\udfff is intepreted as 3 expressions:

\ud800; \udc00-\udbff; \udfff  

and, of course, the middle one fails.

So, my question is how convert above java regular expression into javascript.

Thanks.

==== UPDATE ====

Thanks to @collapsar comments, I tried to make two regular expressions.
Because of that, I realized I can't negate characters [^...].
It'll discard correct characters like U+10001. I mean, this is not right:

function validateIllegalChars(str) {
    var re1 = /[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD]/g; 
    var re2 = /[^[\uD800-\uDBFF][\uDC00-\uDFFF]]/g;
    var str2 = str.replace(re1, '').replace(re2, ''); // First replace would remove all valid characters [#x10000-#x10FFFF]
    alert('str2:' + str2);
    if (str2 != str) return false;
    return true;
}

Then, I tried next (http://jsfiddle.net/mymxyjaf/18/):

function valPos(str) { 
    var re1 = /[\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD]/g; 
    var re2 = /[\uD800-\uDBFF][\uDC00-\uDFFF]/g;

    var str2 = str.replace(re1, '').replace(re2, ''); 
    if (str2.length === 0) return true; 
    alert('str2:' + str2 + '; length: ' + str2.length);
    return false; 
}

However, when I call this function: valPos('eo' + String.fromCharCode(65537)), where 65537 is U+10001 it returns false. What is wrong or how can I solve it?

Community
  • 1
  • 1
Albert
  • 1,156
  • 1
  • 15
  • 27
  • the `\u` notation (so far) only supports up to 32 bit codepoints. [This SO answer](http://stackoverflow.com/a/16346705) will solve your problem ( specify the code points in question as surrogate pairs ). However, you _should_ be able to use the original solution if you create a RegExp object from a string: `new RegExp ( xml10pattern );` with `xml10pattern` defined as in your question. – collapsar Mar 13 '15 at 12:11
  • @collapsar, I think it does not work. For instance, `U+D801` shouldn't be accepted (it's not valid XML) and it seems accepted: http://jsfiddle.net/mymxyjaf/16/. What is it wrong? – Albert Mar 13 '15 at 12:42
  • In your fiddle,you have nested character classes in your first regex. This is a syntax error. Follow the recipe in the cited answer - you cannot build a single negated character class (ora single regex) because the limits of the offending code points will be represented by _2_ characters. – collapsar Mar 13 '15 at 12:53
  • @collapsar, the expression I just used is `var re = /[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD[\uD800-\uDBFF][\uDC00-\uDFFF]]/g;`. It looks like it don't take `U+D801` as surrogate pair. It seems it only check first part `[\uD800-\uDBFF]` – Albert Mar 13 '15 at 12:53
  • @collapsar, so you mean I must use two regular expressions? One for 32-bits codepoints, and the other for `U+10000 - U+10FFFF`? – Albert Mar 13 '15 at 12:56
  • At least that's the way I understood the cited answer. – collapsar Mar 13 '15 at 13:09
  • @collapsar Thanks, but still not working. Negate `(^)` won't work because it remove valid chars (like `U+10001`). So, I tried without negate: http://jsfiddle.net/mymxyjaf/18/ (function `valPos()`) , but I doesn't work, either. – Albert Mar 13 '15 at 13:37
  • Reverse your substitutions: `var str2 = str.replace(re2, '').replace(re1, '');` (instead of `var str2 = str.replace(re1, '').replace(re2, '');`) – collapsar Mar 13 '15 at 14:52
  • I already did and nothing. I think something is wrong with regexp or with function `String.fromCharCode(65537)`, because I even tried simple sample like: `var strA = 'eo' + String.fromCharCode(65537); var strB = strA.replace(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g, '');` and doesn't replace such character. I can't understand and I don't what else to try. Thanks for the effort. – Albert Mar 13 '15 at 15:03
  • Have you tested different browsers? Perhaps it's just a bug in the interpreter of a specific browser / version? – Marvin Emil Brach Mar 13 '15 at 23:58
  • @MarvinEmilBrach, I tested on different version and OS of firefox, but not in other browsers. Not even this simple example works: http://jsfiddle.net/xpg9kvzp/. Should remove _weird_ character, but it doesn't. – Albert Mar 14 '15 at 20:02

1 Answers1

5

I finally solved.

The answer to my own question, as @collapsar told me, could be:

function validateIllegalChars(str) { 

    var re1 = /[\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD]/g;  // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] 
    var re2 = /[\uD800-\uDBFF][\uDC00-\uDFFF]/g; // [#x10000-#x10FFFF]

    var res = str.replace(re1, '').replace(re2, ''); // Should remove any valid character

    if (!!res && res.length > 0) {  // any remaining characters, means input str is not valid 
        return false; 
    }

    return true; 
} 

The previous examples (the ones I post in jsfiddle) didn't work to me, because String.fromCharCode(65537) does no generate character with code point U+10001, as I thought, but U+0001.

Thanks for help.

JLRishe
  • 99,490
  • 19
  • 131
  • 169
Albert
  • 1,156
  • 1
  • 15
  • 27