0

I'm looking to detect non-english keyboard characters in a chat application.

Right now I use the following regex to identify languages for example Russian and Mandarin.

const languageRegEx = /[^\x00-\x7F]+/gi;

This has been working well however I have now hit an issue where emoji's being using in the chat are being caught with the above regex.

I have attempted to remove emojis from the input string using the following:

const ranges = [
  '[\u00A0-\u269f]',
  '[\u26A0-\u329f]',
  // The following characters could not be minified correctly
  // if specifed with the ES6 syntax \u{1F400}
  '[-]'
  //'[\u{1F004}-\u{1F9C0}]'
];

function removeInvalidChars(text) {
  return text.replace(new RegExp(ranges.join('|'), 'ug'), '');
}

It appears as though this works nicely, an inbound message such as:

❤️ hey there

Results in:

" hey there"

However, when I then pass the string " hey there" into my languageRegEx I am receiving a false positive.

const languageRegEx = /[^\x00-\x7F]+/gi;
const badLanguageFound = languageRegEx.test(messageClean);

With badLanguageFound returning true, when actually

I can clearly see the string in my debug is simpyly " hey there" I've also tried to check for hidden characters/unprintable characters but it doesn't appear to be helping.

I then went on to check that instead of removing the emojis with a blank, I would use an x to ensure there is a char for every emoji removed. When checking this in regexr with the pasted returned values I've noticed the heart symbol seems to be picked up: Emoji replacement I find it strange that when I just replace with '' it does not pick up anything, but when I replace with x it is highlighting.

Any advice ? My head is pounding trying to work this one out.

munkee
  • 759
  • 1
  • 9
  • 23

1 Answers1

0

So it appears there is a problem with hidden characters as well as the underlying spans of emoji's causing issues. In the end I found that someone had hit this same problem and produced a lovely little node packaged to help out.

Resulting code became quite simple

const emojiAware = require('emoji-aware');
const messageClean = emojiAware.withoutEmoji(messageText).filter(str => /\S/.test(str)).join('');
const languageRegEx = /[^\x00-\x7F]+/i; // eslint-disable-line
const badLanguageFound = languageRegEx.test(messageClean);

This handles such cases as:

顶级Model白可可
亲临SHOW直播间
空投超模


CEO 胡震生
直播入口:胡震生微博
直播平台:一直播
房间ID:22433681
2018 05/11  22:00

and

 sdf sdfsdf df❤️口:胡震生微sd f口:胡震生微  ds
munkee
  • 759
  • 1
  • 9
  • 23