How to ban words with diacritics using a blacklist array and regex?

Question

I have an input of type text where I return true or false depending on a list of banned words. Everything works fine. My problem is that I don't know how to check against words with diacritics from the array:

var bannedWords = ["bad", "mad", "testing", "băţ"];
var regex = new RegExp('\\b' + bannedWords.join("\\b|\\b") + '\\b', 'i');

$(function () {
  $("input").on("change", function () {
    var valid = !regex.test(this.value);
    alert(valid);
  });
});

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<input type='text' name='word_to_check'>

Now on the word băţ it returns true instead of false for example.

Possible duplicate of [utf-8 word boundary regex in javascript](http://stackoverflow.com/questions/2881445/utf-8-word-boundary-regex-in-javascript) — Chiu, Aug 25 '16 at 08:52
That link does not help me. Or at least I don't understand how is helping me. Can you explain why do you think that my question is a duplicate of that? — Ionut Necula, Aug 25 '16 at 12:05
Instead of using the word boundary `\b`, try using what the referring answer suggested. And ăţ are not ASCII characters. That's why `\b` fails. This is where the utf-8 word steps in. — Chiu, Aug 25 '16 at 12:54
Simply put, diacritics means utf-8. That's why I flagged your question duplicated. Hope it helps. — Chiu, Aug 25 '16 at 13:57
I'm not sure of what the problem is. If you have a _list_ of banned words, put them into a single regex with alternations. Then check that. Why go through all this hassle? If you have a large list, make a regex trie out of a ternary tree. Grab this app (**[screenshot](http://www.regexformat.com/version_files/Rx5_ScrnSht01.jpg)**) to make it for you. And you shouldn't be using a word boundary anyway, you should use a whitespace boundary. `(?<!\S)(?:stuff|or|stuff)(?!\S)` — , Aug 31 '16 at 18:09

myf · Accepted Answer · 2016-08-30T09:25:02.867

Chiu's comment is right: 'aaáaa'.match(/\b.+?\b/g) yelds quite counter-intuitive [ "aa", "á", "aa" ], because "word character" (\w) in JavaScript regular expressions is just a shorthand for [A-Za-z0-9_] ('case-insensitive-alpha-numeric-and-underscore'), so word boundary (\b) matches any place between chunk of alpha-numerics and any other character. This makes extracting "Unicode words" quite hard.

For non-unicase writing systems it is possible to identify "word character" by its dual nature: ch.toUpperCase() != ch.toLowerCase(), so your altered snippet could look like this:

var bannedWords = ["bad", "mad", "testing", "băţ", "bať"];
var bannedWordsRegex = new RegExp('-' + bannedWords.join("-|-") + '-', 'i');

$(function() {
  $("input").on("input", function() {
    var invalid = bannedWordsRegex.test(dashPaddedWords(this.value));
    $('#log').html(invalid ? 'bad' : 'good');
  });
  $("input").trigger("input").focus();

  function dashPaddedWords(str) {
    return '-' + str.replace(/./g, wordCharOrDash) + '-';
  };

  function wordCharOrDash(ch) {
    return isWordChar(ch) ? ch : '-'
  };

  function isWordChar(ch) {
    return ch.toUpperCase() != ch.toLowerCase();
  };
});

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<input type='text' name='word_to_check' value="ba">
<p id="log"></p>

Your "just shorthand for" link displays backslashes but should not. I assume this is a markdown error, but I'll let you correct it yourself. — Adam Katz, Aug 29 '16 at 21:09
`[\b]` and `[^\b]` don't work that way. Because these are inside character classes, they are interpreted as "is a backspace character" and "is not a backspace character" ([read more here](http://www.regular-expressions.info/refcharclass.html?selflavor=javascript "(select “JavaScript” as one of the languages)")). The opposite of `\b` (zero-width word boundary) is `\B` (zero-width non-word boundary), which (in JavaScript) uses the same `[A-Za-z_0-9]` definition of "word characters" and is therefore unhelpful here. — Adam Katz, Aug 29 '16 at 21:17
Thanks for remarks: corrected link formatting. And thanks for that `/^[\b]$/.test('\u0008') === true` quip, I admit I didn't know that. But it was not that relevant, for I just wanted to demonstrate that "\b works just with ASCII" thing like you did in your answer. — myf, Aug 30 '16 at 09:46

Adam Katz · Answer 2 · 2016-09-01T16:04:51.127

Let's see what's going on:

alert("băţ".match(/\w\b/));

This is [ "b" ] because word boundary \b doesn't recognize word characters beyond ASCII. JavaScript's "word characters" are strictly [0-9A-Z_a-z], so aä, pπ, and zƶ match \w\b\W since they contain a word character, a word boundary, and a non-word character.

I think the best you can do is something like this:

var bound = '[^\\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe]';
var regex = new RegExp('(?:^|' + bound + ')(?:'
                       + bannedWords.join('|')
                       + ')(?=' + bound + '|$)', 'i');

where bound is a reversed list of all ASCII word characters plus most Latin-esque letters, used with start/end of line markers to approximate an internationalized \b. (The second of which is a zero-width lookahead that better mimics \b and therefore works well with the g regex flag.)

Given ["bad", "mad", "testing", "băţ"], this becomes:

/(?:^|[^\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe])(?:bad|mad|testing|băţ)(?=[^\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe]|$)/i

This doesn't need anything like ….join('\\b|\\b')… because there are parentheses around the list (and that would create things like \b(?:hey\b|\byou)\b, which is akin to \bhey\b\b|\b\byou\b, including the nonsensical \b\b – which JavaScript interprets as merely \b).

You can also use var bound = '[\\s!-/:-@[-`{-~]' for a simpler ASCII-only list of acceptable non-word characters. Be careful about that order! The dashes indicate ranges between characters.

SamWhan · Answer 3 · 2016-09-02T10:04:44.923

In stead of using word boundary, you could do it with

(?:[^\w\u0080-\u02af]+|^)

to check for start of word, and

(?=[^\w\u0080-\u02af]|$)

to check for the end of it.

The [^\w\u0080-\u02af] matches any characters not (^) being basic Latin word characters - \w - or the Unicode 1_Supplement, Extended-A, Extended-B and Extensions. This include some punctuation, but would get very long to match just letters. It may also have to be extended if other character sets have to be included. See for example Wikipedia.

Since javascript doesn't support look-behinds, the start-of-word test consumes any before mentioned non-word characters, but I don't think that should be a problem. The important thing is that the end-of-word test doesn't.

Also, putting these test outside a non capturing group that alternates the words, makes it significantly more effective.

var bannedWords = ["bad", "mad", "testing", "băţ", "båt", "süß"],
    regex = new RegExp('(?:[^\\w\\u00c0-\\u02af]+|^)(?:' + bannedWords.join("|") + ')(?=[^\\w\\u00c0-\\u02af]|$)', 'i');

function myFunction() {
    document.getElementById('result').innerHTML = 'Banned = ' + regex.test(document.getElementById('word_to_check').value);
}

<!DOCTYPE html>
<html>
<body>

Enter word: <input type='text' id='word_to_check'>
<button onclick='myFunction()'>Test</button>

<p id='result'></p>

</body>
</html>

You forgot to escape backslashes in string literals. Also this will let pass values like `bad!!1!` which I assume should be blocked. — myf, Sep 01 '16 at 14:50
Thanks @myf for pointing that out. I believe it's fixed now :) — SamWhan, Sep 02 '16 at 07:59
better :] although now it bans values such as `băţăţ`, which I assume should be permitted. — myf, Sep 02 '16 at 08:35
Yup, even better. Just look after those `_bad_` underscores that leaked along with `\w` :] — myf, Sep 02 '16 at 20:57

score 2 · Answer 4 · answered Sep 01 '16 at 20:24

You need a Unicode aware word boundary. The easiest way is to use XRegExp package.

Although its \b is still ASCII based, there is a \p{L} (or a shorter pL version) construct that matches any Unicode letter from the BMP plane. To build a custom word boundary using this contruct is easy:

\b                     word            \b
  ---------------------------------------
 |                       |               |
([^\pL0-9_]|^)         word       (?=[^\pL0-9_]|$)

The leading word boundary can be represented with a (non)capturing group ([^\pL0-9_]|^) that matches (and consumes) either a character other than a Unicode letter from the BMP plane, a digit and _ or a start of the string before the word.

The trailing word boundary can be represented with a positive lookahead (?=[^\pL0-9_]|$) that requires a character other than a Unicode letter from the BMP plane, a digit and _ or the end of string after the word.

See the snippet below that will detect băţ as a banned word, and băţy as an allowed word.

var bannedWords = ["bad", "mad", "testing", "băţ"];
var regex = new XRegExp('(?:^|[^\\pL0-9_])(?:' + bannedWords.join("|") + ')(?=$|[^\\pL0-9_])', 'i');

$(function () {
  $("input").on("change", function () {
    var valid = !regex.test(this.value);
    //alert(valid);
    console.log("The word is", valid ? "allowed" : "banned");
  });
});

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/xregexp/3.1.1/xregexp-all.min.js"></script>
<input type='text' name='word_to_check'>

TolMera · Answer 5 · 2016-09-01T09:36:14.797

When dealing with characters outside my base set (which can show up at any time), I convert them to an appropriate base equivalent (8bit, 16bit, 32bit). before running any character matching over them.

var bannedWords = ["bad", "mad", "testing", "băţ"];
var bannedWordsBits = {};
bannedWords.forEach(function(word){
  bannedWordsBits[word] = "";
  for (var i = 0; i < word.length; i++){
    bannedWordsBits[word] += word.charCodeAt(i).toString(16) + "-";
  }
});
var bannedWordsJoin = []
var keys = Object.keys(bannedWordsBits);
keys.forEach(function(key){
  bannedWordsJoin.push(bannedWordsBits[key]);
});
var regex = new RegExp(bannedWordsJoin.join("|"), 'i');

function checkword(word) {
  var wordBits = "";
  for (var i = 0; i < word.length; i++){
    wordBits += word.charCodeAt(i).toString(16) + "-";
  }
  return !regex.test(wordBits);
};

The separator "-" is there to make sure that unique characters don't bleed together creating undesired matches.

Very useful as it brings all the characters down to a common base that everything can interact with. And this can be re-encoded back to it's original without having to ship it in key/value pair.

For me the best thing about it is that I don't have to know all of the rules for all of the character sets that I might intersect with, because I can pull them all into a common playing field.

As a side note:

To speed things up, rather than passing the large regex statement that you probably have, which takes exponentially longer to pass with the length of the words that you're banning, I would pass each separate word in the sentence through the filter. And break the filter up into length based segments. like;

checkword3Chars();
checkword4Chars();
checkword5chars();

who's functions you can generate systematically and even create on the fly as and when they become required.

How to ban words with diacritics using a blacklist array and regex?

5 Answers5

Linked

Related