43

I am having trouble coming up with a regular expression which would essentially black list certain special characters.

I need to use this to validate data in input fields (in a Java Web app). We want to allow users to enter any digit, letter (we need to include accented characters, ex. French or German) and some special characters such as '-. etc.

How do I blacklist characters such as <>%$ etc?

TylerH
  • 20,799
  • 66
  • 75
  • 101
  • 9
    I'll put this in a comment since it isn't a complete solution, but only a suggestion. You are going to be much better off white-listing characters than blacklisting them since there are likely to be far fewer characters you want to allow than deny. – JohnFx Apr 16 '09 at 15:07
  • Check my updated answer for using unicode ranges, perhaps that would simplify the whitelist issue? – Jason Coyne Apr 16 '09 at 15:34
  • In the blacklist mode, japanse, chinese, korean etc will all be allowed. Is this acceptable? – Jason Coyne Apr 16 '09 at 17:25

11 Answers11

58

I would just white list the characters.

^[a-zA-Z0-9äöüÄÖÜ]*$

Building a black list is equally simple with regex but you might need to add much more characters - there are a lot of Chinese symbols in unicode ... ;)

^[^<>%$]*$

The expression [^(many characters here)] just matches any character that is not listed.

Daniel Brückner
  • 59,031
  • 16
  • 99
  • 143
  • 3
    Your whitelist pattern does only include the German umlaut, but no French or other characters - and there are many common ones... like: ñëÿêâôîíì etc. therefore, basically only using a Unicode character group makes whitelisting possible with the requirement given. – Lucero Apr 16 '09 at 15:19
  • 1
    Of course ... only an example and the Umlaute were easiest to type on a German keyboard. – Daniel Brückner Apr 16 '09 at 15:51
  • 4
    You didn't get the point I was trying to make. It's not about your choice of characters as sample, but about not really being able to whitelist all possible combinations. – Lucero Apr 16 '09 at 16:15
  • Why not? There aren't that many accented letters. If you have to manage a separate list for each language, so be it. – Armstrongest Sep 30 '09 at 16:57
  • 3
    @Atomiton, Vietnamese (for example) has 11 vowel nuclei, each of which can have one of 5 accents (ex: ệ) as well as the letter đ. Polish has Ł Ź Ś Ę... Turkish has the dotted I, İ. There are hundreds of different accented letters. – Jacob Krall Sep 30 '09 at 17:03
  • 2
    There are a few hundred he wants to include but there are several thousands he wants to exclude. – Daniel Brückner Sep 30 '09 at 18:11
  • If you want to continue matching even after a non-match character is found, you can use ([a-zA-Z0-9]+) – mnagdev Jan 03 '23 at 07:16
11

To exclude certain characters ( <, >, %, and $), you can make a regular expression like this:

[<>%\$]

This regular expression will match all inputs that have a blacklisted character in them. The brackets define a character class, and the \ is necessary before the dollar sign because dollar sign has a special meaning in regular expressions.

To add more characters to the black list, just insert them between the brackets; order does not matter.

According to some Java documentation for regular expressions, you could use the expression like this:

Pattern p = Pattern.compile("[<>%\$]");
Matcher m = p.matcher(unsafeInputString);
if (m.matches())
{
    // Invalid input: reject it, or remove/change the offending characters.
}
else
{
    // Valid input.
}
Danielson
  • 2,605
  • 2
  • 28
  • 51
David Grayson
  • 84,103
  • 24
  • 152
  • 189
  • matches() returns true iff the regex matches the whole string, as if it were anchored at both ends with '^' and '$'; you would need to use find() for this approach to work. But see the other answers for why a blacklist is bad idea. – Alan Moore Apr 16 '09 at 22:18
  • Also, most metacharacters lose their special meanings when they're in a character class, so there's no need to escape the '$'. But if you did need to escape it you would have to use two backslashes ("\\$") because it's in a Java String literal. – Alan Moore Apr 16 '09 at 22:22
  • @How to remove those characters from string ,"replaceAll" mehtod is removing valid characters from the String – Sanshayan Nov 03 '18 at 11:27
8

Even in 2009, it seems too many had a very limited idea of what designing for the WORLDWIDE web involved. In 2015, unless designing for a specific country, a blacklist is the only way to accommodate the vast number of characters that may be valid.

The characters to blacklist then need to be chosen according what is illegal for the purpose for which the data is required.

However, sometimes it pays to break down the requirements, and handle each separately. Here look-ahead is your friend. These are sections bounded by (?=) for positive, and (?!) for negative, and effectively become AND blocks, because when the block is processed, if not failed, the regex processor will begin at the start of the text with the next block. Effectively, each look-ahead block will be preceded by the ^, and if its pattern is greedy, include up to the $. Even the ancient VB6/VBA (Office) 5.5 regex engine supports look-ahead.

So, to build up a full regular expression, start with the look-ahead blocks, then add the blacklisted character block before the final $.

For example, to limit the total numbers of characters, say between 3 and 15 inclusive, start with the positive look-ahead block (?=^.{3,15}$). Note that this needed its own ^ and $ to ensure that it covered all the text.

Now, while you might want to allow _ and -, you may not want to start or end with them, so add the two negative look-ahead blocks, (?!^[_-].+) for starts, and (?!.+[_-]$) for ends.

If you don't want multiple _ and -, add a negative look-ahead block of (?!.*[_-]{2,}). This will also exclude _- and -_ sequences.

If there are no more look-ahead blocks, then add the blacklist block before the $, such as [^<>[\]{\}|\\\/^~%# :;,$%?\0-\cZ]+, where the \0-\cZ excludes null and control characters, including NL (\n) and CR (\r). The final + ensures that all the text is greedily included.

Within the Unicode domain, there may well be other code-points or blocks that need to be excluded as well, but certainly a lot less than all the blocks that would have to be included in a whitelist.

The whole regex of all of the above would then be

(?=^.{3,15}$)(?!^[_-].+)(?!.+[_-]$)(?!.*[_-]{2,})[^<>[\]{}|\\\/^~%# :;,$%?\0-\cZ]+$

which you can check out live on https://regex101.com/, for pcre (php), javascript and python regex engines. I don't know where the java regex fits in those, but you may need to modify the regex to cater for its idiosyncrasies.

If you want to include spaces, but not _, just swap them every where in the regex.

The most useful application for this technique is for the pattern attribute for HTML input fields, where a single expression is required, returning a false for failure, thus making the field invalid, allowing input:invalid css to highlight it, and stopping the form being submitted.

Patanjali
  • 893
  • 13
  • 17
  • When providing answers that include regex, be aware that some characters, like _ and *, may disappear in the final render of the text of your answer. In that case, precede them with a \. Sometimes, only the first occurrence of the character may need the \ to ensure all of that character show up in the regex. It's not consistent, so watch the rendered text as you type in, and add \ as required. – Patanjali Nov 28 '15 at 00:16
  • @Mariano. You have obviously edited my answer to highlight the regex, but you obviously DIDN'T read my comment above that some \ were needed to be inserted so the character after each was visible. Your editing has LEFT IN the now unnecessary \s. I will now edit them out. If you are going to mess with answers, do the full edit, or leave them alone. – Patanjali Nov 29 '15 at 09:20
  • @Mariano. You left in four \, which I have now eliminated. You were correct in that the second look-ahead was incorrect. My bad, as I had worked it out for a starting _-, but then remembered about trailing ones, and did an ad-hoc edit without testing. You were also correct about the trailing `.*' in the then third look-ahead. Also, about the starting ^. Hat-trick! – Patanjali Nov 29 '15 at 09:50
  • @Mariano. Where did you go? That's what I call guerrilla editing! – Patanjali Nov 29 '15 at 09:54
  • I'm glad it helped as to provide a better answer overall. If you don't agree with an edit, you have all right to rollback. I removed the previous comments as they are now obsolete. – Mariano Nov 29 '15 at 09:59
  • @Mariano (comments disappeared now), asked how this regex was supposed to help blacklists. I wanted to show how to cover several issues related to excluding characters, as some characters may be legal, except for certain circumstances, such as for the _ and - not being allowed to start, end or be in multiples, but otherwise are OK. This would be impossible without the AND functionality look-ahead provides. However, I sometimes wonder if we would have been better off with ABNF expressions instead of regex, as they are more flexible and expressive, and don't get so obtuse when they get large. – Patanjali Nov 29 '15 at 10:05
  • Good to see a working version now. The alternative to the `AND`s would be negating the result of a match with the regex `^[-_]|[-_](?:[-_]|$)|[<>[\]{}|\\\/^~%# :;,$%?\0-\cZ]`. As for the length check, it could be done independently. Also, if you're interested, [Parse2](http://www.parse2.com/) generates parse trees from ABNF. – Mariano Nov 29 '15 at 10:29
  • 1
    @Mariano. De Morgan's Theorem in action! However, not always able to choose the negative externally, especially if consistency of the output logic is required. Being able to get a positive match, including length, in one expression means it can be used for the pattern attribute in HTML input text fields, which is where I suspect this will be used a lot. – Patanjali Nov 29 '15 at 22:02
  • This part was useful for me "hese are sections bounded by (?=) for positive, and (?!) for negative" – E A May 01 '17 at 13:09
6

The negated set of everything that is not alphanumeric & underscore for ASCII chars:

/[^\W]/g

For email or username validation i've used the following expression that allows 4 standard special characters - _ . @

/^[-.@_a-z0-9]+$/gi

For a strict alphanumeric only expression use:

/^[a-z0-9]+$/gi

Test @ RegExr.com

MCGRAW
  • 777
  • 14
  • 37
  • 1
    The OP's requirement was to be able to include other languages. `\w` and `\W` only deal with ASCII. Also, `-`, unless used in a range, must be last in a `[]` term. – Patanjali Jan 25 '18 at 16:08
  • @patanjali /^[-.@_a-z0-9]+$/gi this works, no doubt about it. – MCGRAW Jan 26 '18 at 23:45
  • Learn something new each day: `-` can be at the start or end of a `[]` expression, which makes sense. However, `/[^\W]/` couldn't handle `á`. To really handle multilingual text, have to use atomic expressions like `\p{Ll}\p{M}*` to match a letter, and `\p{N}` to match a number. – Patanjali Jan 27 '18 at 16:31
  • To my comment above, it should be `\p{L}\p{M}*` to handle any case letter. The given one was for lower case. – Patanjali Mar 28 '18 at 14:26
6

I guess it depends what language you are targeting. In general, something like this should work:

[^<>%$]

The "[]" construct defines a character class, which will match any of the listed characters. Putting "^" as the first character negates the match, ie: any character OTHER than one of those listed.

You may need to escape some of the characters within the "[]", depending on what language/regex engine you are using.

KarstenF
  • 5,345
  • 3
  • 21
  • 16
5

Its usually better to whitelist characters you allow, rather than to blacklist characters you don't allow. both from a security standpoint, and from an ease of implementation standpoint.

If you do go down the blacklist route, here is an example, but be warned, the syntax is not simple.

http://groups.google.com/group/regex/browse_thread/thread/0795c1b958561a07

If you want to whitelist all the accent characters, perhaps using unicode ranges would help? Check out this link.

http://www.regular-expressions.info/unicode.html

Jason Coyne
  • 6,509
  • 8
  • 40
  • 70
  • Thanks for your reply. We tried whitelisting first but it is not practical since we want to allow any accented characters. We started with this: ^[a-zA-Z0-9. '-]+$ then we had to add all the French characters manually. Now we need all the German ones and so on. –  Apr 16 '09 at 15:15
  • Have a look on my pattern, it whitelists all characters including all accented ones. – Lucero Apr 16 '09 at 15:20
  • According to Gaijin's link, Lucero's pattern is too simplistic; check out the section labeled "Unicode Character Properties". (You need something like "\p{L}\p{M}*" to really catch all accented characters.) But I'm quite certain a whitelist is the way to go; a fully-populated blacklist will hurt. – BlairHippo Apr 16 '09 at 15:56
2

Do you really want to blacklist specific characters or rather whitelist the allowed charachters?

I assume that you actually want the latter. This is pretty simple (add any additional symbols to whitelist into the [\-] group):

^(?:\p{L}\p{M}*|[\-])*$

Edit: Optimized the pattern with the input from the comments

Lucero
  • 59,176
  • 9
  • 122
  • 152
  • This is the right idea, but I don't think the capture group is needed, or in the right place. Wouldn't "[-\p{L}]*", used with the `matches()` method, do just fine? – erickson Apr 16 '09 at 15:34
  • Yes it should. However, I wasn't sure how the Java Regex engine handles [-\p{L}] exactly; I'd at least escape the - character. Or you can make a non-capturing group (which makes the reges a little less easy to read): ^(?:\p{L}|[\-])*$ – Lucero Apr 16 '09 at 15:51
  • See the second of Gaijin's two links, under "Unicode Character Properties" -- this might not catch everything it needs to, depending on how the character is encoded. (That page suggests "\p{L}\p{M}*".) But it definitely feels like it's close to being the solution. – BlairHippo Apr 16 '09 at 16:06
  • This depends mainly whether the string is normalized or not, but yes, this is a valid point. – Lucero Apr 16 '09 at 16:11
1

Here's all the french accented characters: àÀâÂäÄáÁéÉèÈêÊëËìÌîÎïÏòÒôÔöÖùÙûÛüÜçÇ’ñ

I would google a list of German accented characters. There aren't THAT many. You should be able to get them all.

For URLS I Replace accented URLs with regular letters like so:

string beforeConversion = "àÀâÂäÄáÁéÉèÈêÊëËìÌîÎïÏòÒôÔöÖùÙûÛüÜçÇ’ñ";
string afterConversion = "aAaAaAaAeEeEeEeEiIiIiIoOoOoOuUuUuUcC'n";
for (int i = 0; i < beforeConversion.Length; i++) {

     cleaned = Regex.Replace(cleaned, beforeConversion[i].ToString(), afterConversion[i].ToString());
}

There's probably a more efficient way, mind you.

Armstrongest
  • 15,181
  • 13
  • 67
  • 106
  • 1
    Note that the OP only used French and German as examples, not as an exhaustive list, without indicating how big the list was. Many have assumed that they were mistaken in ASKING for a blacklist. – Patanjali Nov 28 '15 at 02:45
1

Why do you consider regex the best tool for this? If your purpose is to detect whether an illegal character is present in a string, testing each character in a loop will be both simpler and more efficient than constructing a regex.

DJClayworth
  • 26,349
  • 9
  • 53
  • 79
  • The pattern attribute for HTML input fields is designed to take a regex, so why write a program to do the same thing? – Patanjali Nov 29 '15 at 22:29
0

Use This one

^(?=[a-zA-Z0-9~@#$^*()_+=[\]{}|\\,.?: -]*$)(?!.*[<>'"/;`%])
TylerH
  • 20,799
  • 66
  • 75
  • 101
0

I strongly suspect it's going to be easier to come up with a list of the characters that ARE allowed vs. the ones that aren't -- and once you have that list, the regex syntax becomes quite straightforward. So put me down as another vote for "whitelist".

BlairHippo
  • 9,502
  • 10
  • 54
  • 78