3

How can i inlude the use of latin chars like ČčĆ抚Đđ in this javascript regexp

var regex = new RegExp('\\b' + this.value, "i");

UPDATE:

I have this code for filtering checkbox label, but it doesnt work well when there is an input with Č č ć

function listFilter(list, input) {
    var $lbs = list.find('.css-label');

    function filter(){
        var regex = new RegExp('\\b' + this.value);
        var $els = $lbs.filter(function(){
            return regex.test($(this).text());
        });
        $lbs.not($els).hide().prev().hide();
        $els.show().prev().show();
    };

    input.keyup(filter).change(filter)
}

jQuery(function($){
    listFilter($('#list'), $('.search-filter'))
})

here is a fiddle: DEMO

tchrist
  • 78,834
  • 30
  • 123
  • 180
user2406735
  • 247
  • 1
  • 6
  • 21
  • Can you give us an example of runnable code that doesn't work as expected? – loganfsmyth Jul 30 '13 at 14:29
  • 1
    Is this a duplicate? http://stackoverflow.com/questions/7258375/latin-charcters-included-in-javascript-regex Also check the link from one of the comments in that question: http://stackoverflow.com/questions/280712/javascript-unicode – Gray Jul 30 '13 at 14:29
  • @loganfsmyth Probably something like `new RegExp('\\b' + 'ČčĆ抚Đđ', "i").test('ČčĆ抚Đđ')` which returns false. – Denys Séguret Jul 30 '13 at 14:29
  • Not Latin. Those chars look more like Serbian or Croatian than Latin. – Spudley Jul 30 '13 at 16:14
  • @Gray Yes, probably. I don’t know any partial bandaide for Javascript’s horrendous Unicode non-handling that doesn’t involve [XRegExp](http://xregexp.com/plugins/#unicode), and even that is a far, far cry from the most basic, level-1 compliance with **the published gold standard** for this sort of thing, [***UTS 18: Unicode Regular Expressions***](http://www.unicode.org/reports/tr18/). Because Javascript sucks so bad at Unicode, that forces you to do all Unicode work at the backend in some other more-capable programming language, but that isn’t always possible. – tchrist Jul 30 '13 at 16:20
  • @Spudley You are mistaken. Those are indeed characters from the Latin script. In particular, they are the characters LATIN CAPITAL LETTER C WITH CARON, LATIN SMALL LETTER C WITH CARON, LATIN CAPITAL LETTER C WITH ACUTE, LATIN SMALL LETTER C WITH ACUTE, LATIN CAPITAL LETTER S WITH CARON, LATIN SMALL LETTER S WITH CARON, LATIN CAPITAL LETTER D WITH STROKE, LATIN SMALL LETTER D WITH STROKE, and in that order — provided that this is in Normalization Form C, of course. – tchrist Jul 30 '13 at 16:21

2 Answers2

4

The problem in your regexp is that the word boundary isn't properly detected with those chars (just like \w and \W are badly handled with regards to Unicode).

I'd suggest to start with

new RegExp('(^|[\\s\\.])ČčĆ抚Đđ', "i")

and to add to [\\s\\.] the other chars you may be needing as word boundaries.

If you can't define the expected possible word boundaries, you'd better use a library to produce "Unicode compatible" regular expressions. Some are listed in this related question.

Community
  • 1
  • 1
Denys Séguret
  • 372,613
  • 87
  • 782
  • 758
  • 1
    You need to escape backslashes, if you call it this way, but really, the literal is shorter, so I'd better use that: `/(^|[\s\.])ČčĆ抚Đđ/i`. –  Jul 30 '13 at 16:11
  • @wvxvw thanks (and +1). I had forgotten the string escaping. I supposed OP had a dynamically provided string to include in the regex, that's why I didn't use a literal. – Denys Séguret Jul 30 '13 at 16:13
  • @user2406735 I’m not sure that that’s true that “most” regex engines have poor Unicode support, but Javascript definitely earns the *Worst In Show* award there. Go, Java, Perl, Python, and Ruby all to a better job in that regard — and in some cases, a much better one. For Javascript, you should install [the XRegExp Unicode plug-in](http://xregexp.com/plugins/#unicode), which will help. – tchrist Jul 30 '13 at 16:16
  • @tchrist I'm not sure about "most" (it would be painful anyway to list all engines, especially when you start counting the editor ones) and I agree JS is especially bad on this, so I remove this *"like most engines"*. – Denys Séguret Jul 30 '13 at 16:21
  • 1
    You’re right that it is a big topic to start any sort of enumeration of different programming languages and tools like editors. There’s a summary of Unicode character property support [here](http://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines) in Part 2’s last column. The real story is a lot more complex than that, but it’s a good starting-off point. – tchrist Jul 30 '13 at 16:28
2

try with:

/^[A-z\u00C0-\u00ff\s'\.,-\/#!$%\^&\*;:{}=\-_`~()]+$/

as regular expression.

See the examples below:

var regexp = /[A-z\u00C0-\u00ff]+/g,
  ascii = ' hello !@#$%^&*())_+=',
  latin = 'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏàáâãäåæçèéêëìíîïÐÑÒÓÔÕÖØÙÚÛÜÝÞßðñòóôõöøùúûüýþÿ',
  chinese = ' 你 好 ';

console.log(regexp.test(ascii)); // true
console.log(regexp.test(latin)); // true
console.log(regexp.test(chinese)); // false

Glist: https://gist.github.com/germanattanasio/84cd25395688b7935182

German Attanasio
  • 22,217
  • 7
  • 47
  • 63