I'm looking to detect non-english keyboard characters in a chat application.
Right now I use the following regex to identify languages for example Russian and Mandarin.
const languageRegEx = /[^\x00-\x7F]+/gi;
This has been working well however I have now hit an issue where emoji's being using in the chat are being caught with the above regex.
I have attempted to remove emojis from the input string using the following:
const ranges = [
'[\u00A0-\u269f]',
'[\u26A0-\u329f]',
// The following characters could not be minified correctly
// if specifed with the ES6 syntax \u{1F400}
'[-]'
//'[\u{1F004}-\u{1F9C0}]'
];
function removeInvalidChars(text) {
return text.replace(new RegExp(ranges.join('|'), 'ug'), '');
}
It appears as though this works nicely, an inbound message such as:
❤️ hey there
Results in:
" hey there"
However, when I then pass the string " hey there" into my languageRegEx I am receiving a false positive.
const languageRegEx = /[^\x00-\x7F]+/gi;
const badLanguageFound = languageRegEx.test(messageClean);
With badLanguageFound returning true, when actually
I can clearly see the string in my debug is simpyly " hey there" I've also tried to check for hidden characters/unprintable characters but it doesn't appear to be helping.
I then went on to check that instead of removing the emojis with a blank, I would use an x to ensure there is a char for every emoji removed. When checking this in regexr with the pasted returned values I've noticed the heart symbol seems to be picked up:
I find it strange that when I just replace with '' it does not pick up anything, but when I replace with x it is highlighting.
Any advice ? My head is pounding trying to work this one out.