I have a javascript function which attempts to identify the language of a piece of text and see if matches a specific language.
For example, I send the function the language of "Italian" and it attempts to see if the text contains a certain number of very common Italian words.
Part of the code looks like this and it works absolutely fine:
switch ( defLanguage ) {
case "Italian":
var foreign_count = str.match(/\b(non|di|che|è|e|la|il|un|a|per|in|una|mi|sono|ho|ma|l'|lo|ha|le)\b/g).length;
break;
case "German":
var foreign_count = str.match(/\b(das|ist|Sie|ich|nicht|die|es|und|der|was|ein|zu|er|in|sie|mir|mit|den|auf|mich)\b/g).length;
break;
}
This returns foreign_count which tells me how many "foreign" words are in the text.
So far, so good. But with French there's a problem.
If I put the \b word boundary around the possible words it doesn't work (i.e. the javascript stops from then on).
var foreign_count = str.match(/\b(le|de|un|à|avec|et|en|je|que|pour|dans|ce|il|qui|ne|sur|se|pas|plus|par)\b/g).length;
However, if I remove the \b then it does work!
var foreign_count = str.match(/(le|de|un|à|avec|et|en|je|que|pour|dans|ce|il|qui|ne|sur|se|pas|plus|par)/g).length;
This is driving me up the wall. The \b works fine with the German & Italian (and other language) examples, but doesn't work with French. I can't for the life of me work out why and obviously I need the word boundaries in there so I need to sort this out.
Any help would be much appreciated!
====== further information ========
The problem, it seems, is not related to non-ascii characters.
This does not work:
str.match(/\b(jag|det|du|inte|att|en|och|har|vi|i|han|vad|som)\b/g).length;
But this does:
str.match(/\b(jag|det|du|inte|att|en|och|har|vi|i|han|vad|om)\b/g).length;
It seems that certain words (all in ascii chars) cause an error along with the \b marker. I can't use (?<=\s|^) as lookbehind is not supported in Javascript by all accounts.