Regular expression for excluding some characters with multiline matching

Question

I want to ensure that the user input doesn't contain characters like <, > or &#, whether it is text input or textarea. My pattern:

var pattern = /^((?!&#|<|>).)*$/m;

The problem is, that it still matches multiline strings from a textarea like

this text matches

though this should not, because of this character <

EDIT:

To be more clear, I need exclude &# combination only, not & or #.

Please suggest the solution. Very grateful.

It appears you are looking to exclude all special HTML chars including entities such as: ` `, ``, etc. (ergo the exclusion ``). If so, then you probably want to also exclude the other syntax of HTML entities: e.g. `&`, `<`, etc. (Which do NOT have the `#` hash following the `&`) Yes? — ridgerunner, Jul 29 '13 at 16:22

score 2 · Answer 1 · answered Jul 29 '13 at 15:36

2

I don't think you need a lookaround assertion in this case. Simply use a negated character class:

var pattern = /^[^<>&#]*$/m;

If you're also disallowing the following characters, -, [, ], make sure to escape them or put them in proper order:

var pattern = /^[^][<>&#-]*$/m;

answered Jul 29 '13 at 15:36

Andrew Cheong

29,362
15
90
145

Thanks for the reply. The problem still exists though. First of all, I need to exclude `` combination, but allow `&` or `#`. Second, `/^[^<>]*$/m.test(multiLineText)` returns true, if the second line in `multiLineText` contains special character. – Zabavsky Jul 29 '13 at 15:50
1

Ah, I misunderstood that `` was a single entity, sorry. @anuhbhava's got it, though I'm not sure what his extra dot-all class (that you're omitting) is about. – Andrew Cheong Jul 29 '13 at 16:23
Doesn't matter, the main part is that the dot `.` should be replaced with `[\s\S]` for multiline matching. Thanks for your help anyway. – Zabavsky Jul 29 '13 at 16:59

anubhava · Accepted Answer · 2013-07-29T16:22:50.553

2

You're probably not looking for m (multiline) switch but s (DOTALL) switch in Javascript. Unfortunately s doesn't exist in Javascript.

However good news that DOTALL can be simulated using [\s\S]. Try following regex:

/^(?![\s\S]*?(&#|<|>))[\s\S]*$/

OR:

/^((?!&#|<|>)[\s\S])*$/

Live Demo

edited Jul 29 '13 at 16:22

answered Jul 29 '13 at 16:06

anubhava

761,203
64
569
643

Ok let me see if I can reproduce the problem. – anubhava Jul 29 '13 at 16:11
OK edited my answer after some testing and added some explanation as well. – anubhava Jul 29 '13 at 16:14
You are right. I found the answer [here](http://stackoverflow.com/a/16119722/1199711) little earlier. Though I use `/^((?!|<|>)[\s\S])*$/` pattern, I will accept your answer as a correct one. Thanks for your help. – Zabavsky Jul 29 '13 at 16:18

score 2 · Answer 3 · answered Jul 29 '13 at 19:08

Alternate answer to specific question:

anubhava's solution works accurately, but is slow because it must perform a negative lookahead at each and every character position in the string. A simpler approach is to use reverse logic. i.e. Instead of verifying that: /^((?!&#|<|>)[\s\S])*$/ does match, verify that /[<>]|&#/ does NOT match. To illustrate this, lets create a function: hasSpecial() which tests if a string has one of the special chars. Here are two versions, the first uses anubhava's second regex:

function hasSpecial_1(text) {
    // If regex matches, then string does NOT contain special chars.
    return /^((?!&#|<|>)[\s\S])*$/.test(text) ? false : true;
}
function hasSpecial_2(text) {
    // If regex matches, then string contains (at least) one special char.
    return /[<>]|&#/.test(text) ? true : false;
}

These two functions are functionally equivalent, but the second one is probably quite a bit faster.

Note that when I originally read this question, I misinterpreted it to really want to exclude HTML special chars (including HTML entities). If that were the case, then the following solution will do just that.

Test if a string contains HTML special Chars:

It appears that the OP want to ensure a string does not contain any special HTML characters including: <, >, as well as decimal and hex HTML entities such as:  ,  , etc. If this is the case then the solution should probably also exclude the other (named) type of HTML entities such as: &, <, etc. The solution below excludes all three forms of HTML entities as well as the <> tag delimiters.

Here are two approaches: (Note that both approaches do allow the sequence: &# if it is not part of a valid HTML entity.)

FALSE test using positive regex:

function hasHtmlSpecial_1(text) {
    /* Commented regex:
        # Match string having no special HTML chars.
        ^                  # Anchor to start of string.
        [^<>&]*            # Zero or more non-[<>&] (normal*).
        (?:                # Unroll the loop. ((special normal*)*)
          &                # Allow a & but only if
          (?!              # not an HTML entity (3 valid types).
            (?:            # One from 3 types of HTML entities.
              [a-z\d]+     # either a named entity,
            | \#\d+        # or a decimal entity,
            | \#x[a-f\d]+  # or a hex entity.
            )              # End group of HTML entity types.
            ;              # All entities end with ";".
          )                # End negative lookahead.
          [^<>&]*          # More (normal*).
        )*                 # End unroll the loop.
        $                  # Anchor to end of string.
    */
    var re = /^[^<>&]*(?:&(?!(?:[a-z\d]+|#\d+|#x[a-f\d]+);)[^<>&]*)*$/i;
    // If regex matches, then string does NOT contain HTML special chars.
    return re.test(text) ? false : true;
}

Note that the above regex utilizes Jeffrey Friedl's "Unrolling-the-Loop" efficiency technique and will run very quickly for both matching and non-matching cases. (See his regex masterpiece: Mastering Regular Expressions (3rd Edition))

TRUE test using negative regex:

function hasHtmlSpecial_2(text) {
    /* Commented regex:
        # Match string having one special HTML char.
          [<>]           # Either a tag delimiter
        | &              # or a & if start of
          (?:            # one of 3 types of HTML entities.
            [a-z\d]+     # either a named entity,
          | \#\d+        # or a decimal entity,
          | \#x[a-f\d]+  # or a hex entity.
          )              # End group of HTML entity types.
          ;              # All entities end with ";".
    */
    var re = /[<>]|&(?:[a-z\d]+|#\d+|#x[a-f\d]+);/i;
    // If regex matches, then string contains (at least) one special HTML char.
    return re.test(text) ? true : false;
}

Note also that I have included a commented version of each of these (non-trivial) regexes in the form of a JavaScript comment.

I like your idea. Thank you very much. Btw, feel free to edit the question, because it seems a bit confusing to many people. — Zabavsky, Jul 30 '13 at 06:53