0

I have a javascript function which attempts to identify the language of a piece of text and see if matches a specific language.

For example, I send the function the language of "Italian" and it attempts to see if the text contains a certain number of very common Italian words.

Part of the code looks like this and it works absolutely fine:

switch ( defLanguage ) {
    case "Italian":
        var foreign_count = str.match(/\b(non|di|che|è|e|la|il|un|a|per|in|una|mi|sono|ho|ma|l'|lo|ha|le)\b/g).length;
        break;
    case "German":
        var foreign_count = str.match(/\b(das|ist|Sie|ich|nicht|die|es|und|der|was|ein|zu|er|in|sie|mir|mit|den|auf|mich)\b/g).length;
        break;
}

This returns foreign_count which tells me how many "foreign" words are in the text.

So far, so good. But with French there's a problem.

If I put the \b word boundary around the possible words it doesn't work (i.e. the javascript stops from then on).

var foreign_count = str.match(/\b(le|de|un|à|avec|et|en|je|que|pour|dans|ce|il|qui|ne|sur|se|pas|plus|par)\b/g).length;

However, if I remove the \b then it does work!

var foreign_count = str.match(/(le|de|un|à|avec|et|en|je|que|pour|dans|ce|il|qui|ne|sur|se|pas|plus|par)/g).length;

This is driving me up the wall. The \b works fine with the German & Italian (and other language) examples, but doesn't work with French. I can't for the life of me work out why and obviously I need the word boundaries in there so I need to sort this out.

Any help would be much appreciated!

====== further information ========

The problem, it seems, is not related to non-ascii characters.

This does not work:

str.match(/\b(jag|det|du|inte|att|en|och|har|vi|i|han|vad|som)\b/g).length;

But this does:

str.match(/\b(jag|det|du|inte|att|en|och|har|vi|i|han|vad|om)\b/g).length;

It seems that certain words (all in ascii chars) cause an error along with the \b marker. I can't use (?<=\s|^) as lookbehind is not supported in Javascript by all accounts.

arathra
  • 155
  • 1
  • 11
  • On a further tack, Greek doesn't work either with or without the \b – arathra Mar 02 '17 at 18:42
  • What likely happens is your regex fails when encountering non-ASCII characters. See [there](http://stackoverflow.com/questions/150033/regular-expression-to-match-non-english-characters), maybe it'll help. – M. Prokhorov Mar 02 '17 at 18:52
  • Oh, and also [this answer](http://stackoverflow.com/a/280762/7470253) – M. Prokhorov Mar 02 '17 at 19:01
  • This is getting weird. If the regex is **str.match(/\b(in)\b/g).length** then it DOES work but if it's **str.match(/\b(dans)\b/g).length** then it doesn't! Is there something wrong with the **dans** characters? I tried on longer lists of words and it always falls down on these. – arathra Mar 02 '17 at 19:35
  • Use whitespace boundary's instead `(?<!\S)(?:a|b|c)(?!\S)`, problem solved! –  Mar 02 '17 at 20:16

1 Answers1

1

It's because of how \b is defined:

Matches a word boundary. This is the position where a word character is not followed or preceeded by another word-character, such as between a letter and a space. Note that a matched word boundary is not included in the match. In other words, the length of a matched word boundary is zero.

... and how word character (aka \w) is defined:

Matches any alphanumeric character from the basic Latin alphabet, including the underscore. Equivalent to [A-Za-z0-9_].

Clearly à is not a word character so it cannot help match a word boundary.

Álvaro González
  • 142,137
  • 41
  • 261
  • 360