12

I would like to use this regular expression new RegExp("\b"+pat+"\b") in greek text but the "\b" metacharacter supports only ASCII characters.

I tried XregExp library but i didnt manage to solve the issue.

Any suggestions would be greatly appreciated.

slevithan
  • 1,394
  • 13
  • 20
kylito
  • 121
  • 1
  • 4
  • 4
    possible duplicate of [utf-8 word boundary regex in javascript](http://stackoverflow.com/questions/2881445/utf-8-word-boundary-regex-in-javascript) – mplungjan Apr 13 '11 at 13:40
  • Did you use the [Unicode plugin](http://stevenlevithan.com/regex/xregexp/xregexp-unicode.js) to XRegExp? – R. Martinho Fernandes Apr 13 '11 at 13:40
  • 2
    [Javascript does not support Unicode](http://stackoverflow.com/questions/5562835/split-and-replace-unicode-words-in-javascript-with-regex/5596262#5596262), even though this is **the** dominant character set on the web. Use a language that does, and preferably one that meets at *least* the [Level 1 requirements for basic Unicode regular expression support](http://unicode.org/reports/tr18/#Basic_Unicode_Support). – tchrist Apr 13 '11 at 13:58
  • @tchrist: Right. So what language do you suggest using instead for browser scripting? – Tim Down Apr 13 '11 at 14:12
  • 1
    @Martinho, as I explain in my answer, the XRegExp plug in does not correct `\b` to work according to the [requirements of The Unicode Standard](http://unicode.org/reports/tr18/#Basic_Unicode_Support). It cannot be correctly implemented using only Unicode general categories, and even its approximation is mind-bending: `(?:(?<=\w)(?!\w)|(?<!\w)(?=\w))`. You would have to replace `\w` with `[\pL\pM\p{Nd}\p{Nd}\p{Pc}]` wherever it occurs there, and you couldn’t — because Javascript cannot manage to do standard lookbehinds. So that plugin cannot solve this problem. – tchrist Apr 13 '11 at 14:41
  • 1
    @Tim: Because the ECMA standard — and almost all implementations — have dragged their feet for so long that they’re easily more than a decade out of date, I can think of no alternative to offloading more of the heavy-lifting to server-side back-end processing. The ICU regex library and Perl are both Level-1(plus) compliant with the Unicode Standard, so either will work fine with Unicode. Also, PHP, Ruby 1.9, and Python (and in that order) all go a substantial distance further than Javascript does towards compliance, and would at least allow for what the OP desires. Sorry there’s no good news. – tchrist Apr 13 '11 at 14:46
  • @tchrist: It would be possible to build a regular expression library in JavaScript that meets the level 1 Unicode standard. Have you suggested it XRegExp's author? – Tim Down Apr 13 '11 at 15:08
  • @Tim: No, I have not. I think that would be a very very good idea, though. Perhaps you might please do so? – tchrist Apr 13 '11 at 15:17
  • If you're ready to use a capturing group for your actual regexp, you could try something like (^|[^a-zA-Z0-9_])(yourpattern)(?=[^a-zA-Z0-9_]|$). The second group will be the result of your match. – Raze May 01 '11 at 17:36
  • a for alpha and z for omega or upsilon with dialikta or whatever. There might be more stuff to add in the range for greek. – Raze May 01 '11 at 17:42
  • [This answer](https://stackoverflow.com/a/47963750/6440904) helped me with resolving match like `\b${word}\b`. – Flynn Hou May 25 '19 at 15:23

2 Answers2

4

I think this was helpful to your answer.,

<script src="xregexp.js"></script>
<script src="xregexp-unicode-base.js"></script>
<script>
    var unicodeWord = XRegExp("^\\p{L}+$");

    unicodeWord.test("Русский"); // true
    unicodeWord.test("日本語"); // true
    unicodeWord.test("العربية"); // true
</script>

<!-- \p{L} is included in the base script, but other categories, scripts,
and blocks require token packages -->
<script src="xregexp-unicode-scripts.js"></script>
<script>
    XRegExp("^\\p{Katakana}+$").test("カタカナ"); // true
</script>

Please refer the following location : http://xregexp.com/plugins/

John Peter
  • 2,870
  • 3
  • 27
  • 46
2

So the answer is just, that you can not use the JavaScript native mechanisms or any library which uses those mechanisms to match words the way you want to. As you already stated, \b matches words. Words must consists of word characters. And in JavaScript (and actually other regex implementations word characters are a-z, A-Z, 0-9 and _. But many other Languages just implement the \b metacharacter in a different way JavaScript does.

The answer "JavaScript does not support Unicode" is a bit to easy and in fact completely wrong. JavaScript just doesn't use unicode for the character classes. If JavaScript wouldn't support unicode you couldn't even use unicode Characters in String literals and of course this is possible in JavaScript.

According to the ECMA 262 Standard (ECMAScript) (Section 15.10.2.6):

[...] The production Assertion :: \ b evaluates by returning an internal AssertionTester closure that takes a State argument x and performs the following:

  1. Let e be x's endIndex.
  2. Call IsWordChar(e–1) and let a be the Boolean result.
  3. Call IsWordChar(e) and let b be the Boolean result.
  4. If a is true and b is false, return true.
  5. If a is false and b is true, return true.
  6. Return false. [..]

The abstract operation IsWordChar takes an integer parameter e and performs the following:

  1. If e == –1 or e == InputLength, return false.
  2. Let c be the character Input[e].
  3. If c is one of the sixty-three characters below, return true. a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9 _
  4. Return false

This just shows, that the \b uses the Algorithm of "isWordChar" to check if what you try to match is actually a word. Int he definition of "isWordChar" you can see the exact definition of which characters will return true for "isWordChar".

In my Opinion this has absolutely nothing to do with the character set being used. It's neither ASCII nor UNICODE compilant here. It's just these 63 characters.

Chris
  • 7,675
  • 8
  • 51
  • 101