Text Matching not working for Arabic issue may be due to regex for arabic

Question

I have been working to add a functionality to my multilingual website where i have to highlight the matching tag keywords.

This functionality works for English version but doesn't not fire for arabic version.

I have set up sample on JSFiddle

Sample Code

    function HighlightKeywords(keywords)
    {        
        var el = $("#article-detail-desc");
        var language = "ar-AE";
        var pid = 32;
        var issueID = 18; 
        $(keywords).each(function()
        {
           // var pattern = new RegExp("("+this+")", ["gi"]); //breaks html
            var pattern = new RegExp("(\\b"+this+"\\b)(?![^<]*?>)", ["gi"]); //looks for match outside html tags
            var rs = "<a class='ad-keyword-selected' href='http://www.alshindagah.com/ar/search.aspx?Language="+language+"&PageId="+pid+"&issue="+issueID+"&search=$1' title='Seach website for:  $1'><span style='color:#990044; tex-decoration:none;'>$1</span></a>";
            el.html(el.html().replace(pattern, rs));
        });
    }   

HighlightKeywords(["you","الهدف","طهران","سيما","حاليا","Hello","34","english"]);

//Popup Tooltip for article keywords
     $(function() {
        $("#article-detail-desc").tooltip({
        position: {
            my: "center bottom-20",
            at: "center top",
            using: function( position, feedback ) {
            $( this ).css( position );
            $( "<div>" )
            .addClass( "arrow" )
            .addClass( feedback.vertical )
            .addClass( feedback.horizontal )
            .appendTo( this );
        }
        }
        });
    });

I store keywords in array & then match them with the text in a particular div.

I am not sure is problem due to Unicode or what. Help in this respect is appreciated.

It works for me with the first pattern that you made it comment `var pattern = new RegExp("("+this+")", ["gi"]);`. Check this [FIDDLE](http://jsfiddle.net/mojtaba/Dgysc/41/). — , May 21 '13 at 15:50

score 8 · Accepted Answer · edited Jun 20 '20 at 09:12

There are three sections to this answer

Why it's not working
An example of how you could approach it in English (meant to be adapted to Arabic by someone with a clue about Arabic)
A stab at doing the Arabic version by someone (me) who hasn't a clue about Arabic :-)

Why it's not working

At least part of the problem is that you're relying on the \b assertion, which (like its counterparts \B, \w, and \W) is English-centric. You can't rely on it in other languages (or even, really, in English — see below).

Here's the definition of \b in the spec:

The production Assertion :: \ b evaluates by returning an internal AssertionTester closure that takes a State argument x and performs the following:

Let e be x's endIndex.

Call IsWordChar(e–1) and let a be the Boolean result.

Call IsWordChar(e) and let b be the Boolean result.

If a is true and b is false, return true.

If a is false and b is true, return true.

Return false.

...where IsWordChar is defined further down as basically meaning one of these 63 characters:

a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z
A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z
0  1  2  3  4  5  6  7  8  9  _

E.g., the 26 English letters a to z in upper or lower case, the digits 0 to 9, and _. (This means you can't even rely on \b, \B, \w, or \W in English, because English has loan words like "Voilà", but that's another story.)

A first example using English

You'll have to use a different mechanism for detecting word boundaries in Arabic. If you can come up with a character class that includes all of the Arabic "code points" (as Unicode puts it) that make up words, you could use code a bit like this:

var keywords = {
    "laboris": true,
    "laborum": true,
    "pariatur": true
    // ...and so on...
};
var text = /*... get the text to work on... */;
text = text.replace(
    /([abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_]+)([^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_]+)?/g,
    replacer);

function replacer(m, c0, c1) {
    if (keywords[c0]) {
        c0 = '<a href="#">' + c0 + '</a>';
    }
    return c0 + c1;
}

Notes on that:

I've used the class [abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ] to mean "a word character". Obviously you'd have to change this (markedly) for Arabic.
I've used the class [^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ] to mean "not a word character". This is just the same as the previous class with the negation (^) at the outset.
The regular expression finds any series of "word characters" followed by an optional series of non-word characters, using capture groups ((...)) for both.
String#replace calls the replacer function with the full text that matched followed by each capture group as arguments.
The replacer function looks up the first capture group (the word) in the keywords map to see if it's a keyword. If so, it wraps it in an anchor.
The replacer function returns that possibly-wrapped word plus the non-word text that followed it.
String#replace uses the return value from replacer to replace the matched text.

Here's a full example of doing that: Live Copy | Live Source

<!DOCTYPE html>
<html>
<head>
<meta charset=utf-8 />
<title>Replacing Keywords</title>
</head>
<body>
  <p>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
  
  <script src="http://code.jquery.com/jquery-1.9.1.min.js"></script>
  <script>
    (function() {
      // Our keywords. There are lots of ways you can produce
      // this map, here I've just done it literally
      var keywords = {
        "laboris": true,
        "laborum": true,
        "pariatur": true
      };
      
      // Loop through all our paragraphs (okay, so we only have one)
      $("p").each(function() {
        var $this, text;
        
        // We'll use jQuery on `this` more than once,
        // so grab the wrapper
        $this = $(this);
        
        // Get the text of the paragraph
        // Note that this strips off HTML tags, a
        // real-world solution might need to loop
        // through the text nodes rather than act
        // on the full text all at once
        text = $this.text();

        // Do the replacements
        // These character classes match JavaScript's
        // definition of a "word" character and so are
        // English-centric, obviously you'd change that
        text = text.replace(
          /([abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_]+)([^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_]+)?/g,
          replacer);
        
        // Update the paragraph
        $this.html(text);
      });

      // Our replacer. We define it separately rather than
      // inline because we use it more than once      
      function replacer(m, c0, c1) {
        // Is the word in our keywords map?
        if (keywords[c0]) {
          // Yes, wrap it
          c0 = '<a href="#">' + c0 + '</a>';
        }
        return c0 + c1;
      }
    })();
  </script>
</body>
</html>

A stab at doing it with Arabic

I took at stab at the Arabic version. According to the Arabic script in Unicode page on Wikipedia, there are several code ranges used, but all of the text in your example fell into the primary range of U+0600 to U+06FF.

Here's what I came up with: Fiddle (I prefer JSBin, what I used above, but I couldn't get the text to come out the right way around.)

(function() {
    // Our keywords. There are lots of ways you can produce
    // this map, here I've just done it literally
    var keywords = {
        "الهدف": true,
        "طهران": true,
        "سيما": true,
        "حاليا": true
    };
    
    // Loop through all our paragraphs (okay, so we only have two)
    $("p").each(function() {
        var $this, text;
        
        // We'll use jQuery on `this` more than once,
        // so grab the wrapper
        $this = $(this);
        
        // Get the text of the paragraph
        // Note that this strips off HTML tags, a
        // real-world solution might need to loop
        // through the text nodes rather than act
        // on the full text all at once
        text = $this.text();
        
        // Do the replacements
        // These character classes just use the primary
        // Arabic range of U+0600 to U+06FF, you may
        // need to add others.
        text = text.replace(
            /([\u0600-\u06ff]+)([^\u0600-\u06ff]+)?/g,
            replacer);
        
        // Update the paragraph
        $this.html(text);
    });
    
    // Our replacer. We define it separately rather than
    // inline because we use it more than once      
    function replacer(m, c0, c1) {
        // Is the word in our keywords map?
        if (keywords[c0]) {
            // Yes, wrap it
            c0 = '<a href="#">' + c0 + '</a>';
        }
        return c0 + c1;
    }
})();

All I did to my English function above was:

Use [\u0600-\u06ff] to be "a word character" and [^\u0600-\u06ff] to be "not a word character". You may need to add some of the other ranges listed here (such as the appropriate style of numerals), but again, all of the text in your example fell into those ranges.
Change the keywords to be three of yours from your example (only two of which seem to be in the text).

To my very non-Arabic-reading eyes, it seems to work.

Can you point me to any such code snippet which can resolve this issue. — Learning, May 21 '13 at 07:41
@KnowledgeSeeker: I don't know enough about Arabic, I'm afraid. You'll probably want to use [alternation](http://www.ecma-international.org/ecma-262/5.1/#sec-15.10.2.4) and [your own character class(es)](http://www.ecma-international.org/ecma-262/5.1/#sec-15.10.2.13). For instance, `\^|[a-z]` means "beginning of string or `a` to `z`". So if you can do something similar for Arabic... — T.J. Crowder, May 21 '13 at 07:46
Sorry, that should have been `/^|[a-z]/` (the `//` are to denote a regular expression literal, `\^` would make `^` literal, which wasn't what I meant at all). — T.J. Crowder, May 21 '13 at 08:12
@KnowledgeSeeker: I've added an example of how you can do this if you can come up with a character class meaning "an Arabic word character." — T.J. Crowder, May 21 '13 at 08:39
@TJ, Your solution works for English or is effective for those languages where putting each character set character doesn't not break or change, Suppose in English you are doing it like this `[abcd]` but arabic when we put arabic character together to changes into words see. `[ابتث]` when i give space then i looks fine but doesn't work `[اب ت ث]`. I appreciate your example which is very clean unfortunately not working for me. Where `a b c d = اب ت ث` in arabic — Learning, May 21 '13 at 13:23
@KnowledgeSeeker: I'm not quite sure I follow you. Putting characters together in English makes words too, e.g. `[code]`. But when you use `[...]` in a regular expression, the engine looks at the individual characters. I realize Arabic script is a "combining" script but I'm not sure that makes much of a difference for this specific application. I've taken a stab at the Arabic version, see the update. — T.J. Crowder, May 21 '13 at 15:26
+1 for your solution & comments. I am myself bit confused with arabic version agree with you for English for example if i write `[c o d e]` with space & without space it comes like this `[code]`. Now see what happens in arabic i use Google translation for this `[ر م ز]` now when i delete the space see the result `[رمز]`. another example `[working]` `[w o r k i n g]` arabic version `[العمل]` with space now `[ال ع م ل]`. I had the same issue when i was going it as your first example adding but your second example is more appropriate solution `/([\u0600-\u06ff]+)([^\u0600-\u06ff]+)?/` — Learning, May 22 '13 at 04:38
@KnowledgeSeeker: There's a big difference between `[code]` and `[c o d e]`: The second one includes spaces in the character class, and so will match spaces. You don't want that. `[\u0600-\u06ff]` is just identical to listing each of those 255 characters individually in the square brackets. — T.J. Crowder, May 22 '13 at 07:05
@TJ; I had added space just for example purpose just to show how characters in Arabic change to words when we delete this space. — Learning, May 22 '13 at 07:29
@KnowledgeSeeker: See my note above: Not within `[]` they don't! Just as with English. I recommend reading up on how the regular expression syntax works in JavaScript. I've given various links above. — T.J. Crowder, May 22 '13 at 07:34
@TJ, It works well in Arabic for only single words, words with space break, I would appreciate help in this regard so that following kewords match the text in this fiddle example http://jsfiddle.net/u3k01bfw/1/, I have tried it for sometime but still have not worked — Learning, Apr 09 '15 at 13:31

Text Matching not working for Arabic issue may be due to regex for arabic

1 Answers1

There are three sections to this answer

Why it's not working

A first example using English

A stab at doing it with Arabic

Linked