2

This pattern works fine in Java and javascript but does not seem to work in Dart. Any help is appreciated.

void main() {
        String englishText = "The new nature will not find rest";
        String englishFind = "Nature";  
        RegExp englishExp = new RegExp("\\b$englishFind\\b", unicode:true, caseSensitive:false);
        bool englishResult = englishExp.hasMatch(englishText);//matches
        print(englishResult); //true

        String arabicText = "لن تجد الطبيعة الجديدة راحتها";
        String arabicFind="الطبيعة";
        RegExp arabicExp = new RegExp("\\b$arabicFind\\b", unicode:true);
        bool arabicResult = arabicExp.hasMatch(arabicText);//does not match
        print(arabicResult);//false
    }
nick
  • 104
  • 5

1 Answers1

3

\b word boundary is still matching only in ASCII only contexts even when you define unicode:true whose main point is to make sure "UTF-16 surrogate pairs in the original string will be treated as a single code point and will not match separately".

You may "decompose" the word boundary and add Arabic letter and digit ranges to the class:

String arabicText = "لن تجد الطبيعة الجديدة راحتها";
String arabicFind="الطبيعة";
RegExp arabicExp = new RegExp("(?:^|[^a-zA-Z0-9_\\u06F0-\\u06F9\\u0622\\u0627\\u0628\\u067E\\u062A-\\u062C\\u0686\\u062D-\\u0632\\u0698\\u0633-\\u063A\\u0641\\u0642\\u06A9\\u06AF\\u0644-\\u0648\\u06CC\\u202C\\u064B\\u064C\\u064E-\\u0652])$arabicFind(?![a-zA-Z0-9_\\u06F0-\\u06F9\\u0622\\u0627\\u0628\\u067E\\u062A-\\u062C\\u0686\\u062D-\\u0632\\u0698\\u0633-\\u063A\\u0641\\u0642\\u06A9\\u06AF\\u0644-\\u0648\\u06CC\\u202C\\u064B\\u064C\\u064E-\\u0652])", unicode:true);
bool arabicResult = arabicExp.hasMatch(arabicText);//does not match
print(arabicResult); // => true

The regex will match an $arabicFind word when it is

  • (?:^|[^a-zA-Z0-9_\u06F0-\u06F9\u0622\u0627\u0628\u067E\u062A-\u062C\u0686\u062D-\u0632\u0698\u0633-\u063A\u0641\u0642\u06A9\u06AF\u0644-\u0648\u06CC\u202C\u064B\u064C\u064E-\u0652]) - preceded with start of string (^) or (|) any char but ASCII letter, digit or _ and Farsi letters or digits
  • (?![a-zA-Z0-9_\u06F0-\u06F9\u0622\u0627\u0628\u067E\u062A-\u062C\u0686\u062D-\u0632\u0698\u0633-\u063A\u0641\u0642\u06A9\u06AF\u0644-\u0648\u06CC\u202C\u064B\u064C\u064E-\u0652]) - not followed with an ASCII letter, digit or _ and Farsi letters or digits.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks Wiktor, really appreciate your quick response. Will try it out. – nick May 12 '20 at 15:05
  • @nick I think it should be enough, in case you need to support more languages, let me know. – Wiktor Stribiżew May 12 '20 at 15:06
  • Thanks Wiktor! I do need to support Farsi as well. – nick May 12 '20 at 15:08
  • @nick I updated the solution to work with Farsi based on [this answer](https://stackoverflow.com/a/50018691/3832970). – Wiktor Stribiżew May 12 '20 at 15:17
  • Thank you. I marked your answer above as "accepted". Any ideas if the dart team will be fixing this? – nick May 12 '20 at 15:17
  • @nick I do not know, but even in JS latest ECMAScript 2018+ `\b` is still not Unicode aware (`/\bВиктор\b/.test("Виктор")` yields *`false`*). Dart team [seems to be aware](https://github.com/dart-lang/sdk/issues/28404) of the Unicode regex challenges. – Wiktor Stribiżew May 12 '20 at 15:24
  • Wiktor, I have a regex method that removes diacritics form Arabic. I have a list of about 31102 records average length 250. In Flutter dev mode it takes about 800 milliseconds to process, in release mode it take about 10 seconds. I would expect release to process much faster not slower, any ideas why?. String removeDiacritics(){ return this.replaceAll(RegExp(r'\u{0640}|\u{064D}|\u{064C}|\u{064B}|\u{064E}|\u{064F}|\u{0650}|\u{0651}|\u{0652}', unicode:true), '') .replaceAll(RegExp(r'\u{0623}|\u{0625}', unicode:true),'\u{0627}'); } – nick May 16 '20 at 19:24
  • @nick Hi Nick, no idea why that happens, and I agree it is weird, unless there is something in between that hinders the code execution. Note that you should not use `(a|b|c|d)`, you should use `[abcd]`, that is, `r'[\u0640\u064D\u064C\u064B\u064E\u064F\u0650\u0651\u0652]'` and `r'[\u0623\u0625]'`. – Wiktor Stribiżew May 16 '20 at 19:48
  • Thanks Wiktor, I really do appreciate all your comments and input. Will correct the code as you suggested. Btw I am using Flutter 1.18.0-11.1.pre • channel beta. I also noticed that Flutter 1.17 seemed to run faster in general. – nick May 16 '20 at 20:01