0

How can I remove all graphical unicode characters (emoji, symbols, flags, etc) from a string?

I have tried:

$text =~ s/[\x{0001F600}-\x{0001F64F}]|[\x{0001F300}-\x{0001F5FF}]|[\x{0001F680}-\x{0001F6FF}]|[\x{0001F1E0}-\x{0001F1FF}]|[\x{2600}-\x{26FF}]//g;

It removes some characters, but not all.

These characters are left untouched by regex:

What did I miss?

Alex Storm
  • 19
  • 3
  • 1
    what do you want to *keep*? – ysth Sep 19 '19 at 14:44
  • 1
    Which chars fail exactly? A simple search for "regex remove emojis" presents many possible solutions so is there any reason that the existing Internet answers do not suit your needs? – MonkeyZeus Sep 19 '19 at 14:45
  • I have edited my post to clarify the question. I want to keep latin and cyrillic non graphical characters only – Alex Storm Sep 19 '19 at 15:12
  • Where does your string come from? If it is defined literally in source, you may need `use utf8;`. If it comes from STDIN or a file or socket, you may need to decode it from UTF-8, or set an `:encoding(UTF-8)` layer on the handle. And then make sure it is encoded back to UTF-8 for output. – Grinnz Sep 19 '19 at 15:45
  • @Grinnz, 'use utf8' pragma is set. String comes from CGI::Fast – Alex Storm Sep 19 '19 at 16:00
  • @AlexStorm The CGI.pm and CGI::Fast modules do not decode parameters, so you need to do that. `my $param = decode 'UTF-8', scalar $cgi->param('foo');` with decode from [Encode](https://perldoc.pl/Encode). – Grinnz Sep 19 '19 at 16:18
  • See also http://blogs.perl.org/users/grinnz/2018/11/modern-perl-cgi.html – Grinnz Sep 19 '19 at 16:19
  • As an aside, `[a-c]|[e-f]` can be simplified to `[a-ce-f]` – tripleee Sep 19 '19 at 19:45

1 Answers1

0

I am posting solution to my question. This is a perl-version of the answer given here

Created a whitelist for:

  • all numeric (p{N})
  • letter (p{L})
  • mark (p{M})
  • punctuation (p{P})
  • whitespace/separator (p{Z})
  • other formatting (p{Cf}) and other characters above U+FFFF in Unicode (p{Cs}), and newline (\s) characters
  • p{L} specifically includes the characters from other alphabets such as Cyrillic, Latin, Kanji, etc.
$text =~ s/[^\p{L}\p{M}\p{N}\p{P}\p{Z}\p{Cf}\p{Cs}]//g;

or, blacklist all special characters

$text =~ s/\p{So}//g;
Alex Storm
  • 19
  • 3