4

Some user is flooding with some chars that bypass regex filters. when I paste that chars into UTF8 editor, they look same (except the flood version is not selectable completly: it seems to be some invisible chars inserted

enter image description here

And when you switch to ANSI encodage, you clearly see the difference of the 2 words liebehomo lâ€iâ€ebâ€ehâ€oâ€mo

When I paste that spammy word into developer tool, I get

enter image description here

s.length gives 14 and not 9 !

enter image description here

So my question is: how would it be possible to filter these spammy words that contains some strange chars ?

yarek
  • 11,278
  • 30
  • 120
  • 219
  • It's been a while since I messed with this but have you tried splitting the string? And then testing each char through a loop? – EasyBB Jul 30 '20 at 22:46
  • In case anyone who runs into the same issue sees this, the following seems to work: `str.replace(/\p{C}/gu, '');`. Note that the `u` flag is required for this to work. I'm not sure if this works for all invisible characters, though (kinda hard to test) – paddotk Mar 28 '22 at 15:23

1 Answers1

0

Probably as simple as replacing any non-printable character first:

string = string.replace(/[^ -~]+/g, "");

document.getElementById('demo').addEventListener('input', function(e) {
    e.target.innerHTML = e.target.innerHTML.replace(/[^ -~]+/g, "");
    console.log(e.target.innerHTML);
});
<textarea id="demo"></textarea>
dave
  • 62,300
  • 5
  • 72
  • 93