Dart Regex does not match whole word for Arabic text

Question

This pattern works fine in Java and javascript but does not seem to work in Dart. Any help is appreciated.

void main() {
        String englishText = "The new nature will not find rest";
        String englishFind = "Nature";  
        RegExp englishExp = new RegExp("\\b$englishFind\\b", unicode:true, caseSensitive:false);
        bool englishResult = englishExp.hasMatch(englishText);//matches
        print(englishResult); //true

        String arabicText = "لن تجد الطبيعة الجديدة راحتها";
        String arabicFind="الطبيعة";
        RegExp arabicExp = new RegExp("\\b$arabicFind\\b", unicode:true);
        bool arabicResult = arabicExp.hasMatch(arabicText);//does not match
        print(arabicResult);//false
    }

`\b` is only working for ASCII letters/digits. – Wiktor Stribiżew May 12 '20 at 14:28 — Wiktor Stribiżew, May 12 '20 at 14:28

Wiktor Stribiżew · Accepted Answer · 2020-05-12T15:16:47.247

3

\b word boundary is still matching only in ASCII only contexts even when you define unicode:true whose main point is to make sure "UTF-16 surrogate pairs in the original string will be treated as a single code point and will not match separately".

You may "decompose" the word boundary and add Arabic letter and digit ranges to the class:

String arabicText = "لن تجد الطبيعة الجديدة راحتها";
String arabicFind="الطبيعة";
RegExp arabicExp = new RegExp("(?:^|[^a-zA-Z0-9_\\u06F0-\\u06F9\\u0622\\u0627\\u0628\\u067E\\u062A-\\u062C\\u0686\\u062D-\\u0632\\u0698\\u0633-\\u063A\\u0641\\u0642\\u06A9\\u06AF\\u0644-\\u0648\\u06CC\\u202C\\u064B\\u064C\\u064E-\\u0652])$arabicFind(?![a-zA-Z0-9_\\u06F0-\\u06F9\\u0622\\u0627\\u0628\\u067E\\u062A-\\u062C\\u0686\\u062D-\\u0632\\u0698\\u0633-\\u063A\\u0641\\u0642\\u06A9\\u06AF\\u0644-\\u0648\\u06CC\\u202C\\u064B\\u064C\\u064E-\\u0652])", unicode:true);
bool arabicResult = arabicExp.hasMatch(arabicText);//does not match
print(arabicResult); // => true

The regex will match an $arabicFind word when it is

(?:^|[^a-zA-Z0-9_\u06F0-\u06F9\u0622\u0627\u0628\u067E\u062A-\u062C\u0686\u062D-\u0632\u0698\u0633-\u063A\u0641\u0642\u06A9\u06AF\u0644-\u0648\u06CC\u202C\u064B\u064C\u064E-\u0652]) - preceded with start of string (^) or (|) any char but ASCII letter, digit or _ and Farsi letters or digits
(?![a-zA-Z0-9_\u06F0-\u06F9\u0622\u0627\u0628\u067E\u062A-\u062C\u0686\u062D-\u0632\u0698\u0633-\u063A\u0641\u0642\u06A9\u06AF\u0644-\u0648\u06CC\u202C\u064B\u064C\u064E-\u0652]) - not followed with an ASCII letter, digit or _ and Farsi letters or digits.

edited May 12 '20 at 15:16

answered May 12 '20 at 14:45

Wiktor Stribiżew

607,720
39
448
563

Thanks Wiktor, really appreciate your quick response. Will try it out. – nick May 12 '20 at 15:05
@nick I think it should be enough, in case you need to support more languages, let me know. – Wiktor Stribiżew May 12 '20 at 15:06
Thanks Wiktor! I do need to support Farsi as well. – nick May 12 '20 at 15:08
@nick I updated the solution to work with Farsi based on [this answer](https://stackoverflow.com/a/50018691/3832970). – Wiktor Stribiżew May 12 '20 at 15:17
Thank you. I marked your answer above as "accepted". Any ideas if the dart team will be fixing this? – nick May 12 '20 at 15:17
@nick I do not know, but even in JS latest ECMAScript 2018+ `\b` is still not Unicode aware (`/\bВиктор\b/.test("Виктор")` yields *`false`*). Dart team [seems to be aware](https://github.com/dart-lang/sdk/issues/28404) of the Unicode regex challenges. – Wiktor Stribiżew May 12 '20 at 15:24
Wiktor, I have a regex method that removes diacritics form Arabic. I have a list of about 31102 records average length 250. In Flutter dev mode it takes about 800 milliseconds to process, in release mode it take about 10 seconds. I would expect release to process much faster not slower, any ideas why?. String removeDiacritics(){ return this.replaceAll(RegExp(r'\u{0640}|\u{064D}|\u{064C}|\u{064B}|\u{064E}|\u{064F}|\u{0650}|\u{0651}|\u{0652}', unicode:true), '') .replaceAll(RegExp(r'\u{0623}|\u{0625}', unicode:true),'\u{0627}'); } – nick May 16 '20 at 19:24
@nick Hi Nick, no idea why that happens, and I agree it is weird, unless there is something in between that hinders the code execution. Note that you should not use `(a|b|c|d)`, you should use `[abcd]`, that is, `r'[\u0640\u064D\u064C\u064B\u064E\u064F\u0650\u0651\u0652]'` and `r'[\u0623\u0625]'`. – Wiktor Stribiżew May 16 '20 at 19:48
Thanks Wiktor, I really do appreciate all your comments and input. Will correct the code as you suggested. Btw I am using Flutter 1.18.0-11.1.pre • channel beta. I also noticed that Flutter 1.17 seemed to run faster in general. – nick May 16 '20 at 20:01

Dart Regex does not match whole word for Arabic text

1 Answers1