2
#!/usr/bin/perl -T
use strict;
use warnings;
use utf8;
my $s = shift || die;
$s =~ s/[^A-Za-z ]//g;
print "$s\n";
exit;

> ./poc.pl "El Guapö"
El Guap

Is there a way to modify this Perl code so that various umlauts and character accents are not stripped out? Thanks!

Timothy B.
  • 617
  • 1
  • 7
  • 15

1 Answers1

6

For the direct question, you may simply need \p{L} (Letter) Unicode Character Property

However, more importantly, decode all input and encode output.

use warnings;
use strict;
use feature 'say';

use utf8;   # allow non-ascii (UTF-8) characters in the source

use open ':std', ':encoding(UTF-8)';  # for standard streams

use Encode qw(decode_utf8);           # @ARGV escapes the above

my $string = 'El Guapö';
if (@ARGV) {
    $string = join ' ', map { decode_utf8($_) } @ARGV;
}
say "Input:     $string";

$string =~ s/[^\p{L} ]//g;

say "Processed: $string";

When run as   script.pl 123 El Guapö=_

Input:     123 El Guapö=_
Processed:  El Guapö

I've used the "blanket" \p{L} property (Letter), as specific description is lacking; adjust if/as needed. The Unicode properties provide a lot, see the link above and the complete list at perluniprops.

The space between 123 El remains, perhaps strip leading (and trailing) spaces in the end.

Note that there is also \P{L}, where the capital P indicates negation.


The above simple-minded \pL won't work with Combining Diacritical Marks, as the mark will be removed as well. Thanks to jm666 for pointing this out.

This happens when an accented "logical" character (extended grapheme cluster, what appears as a single character) is written using separate characters for its base and for non-spacing mark(s) (combining accents). Often a single character for it with its codepoint also exists.

Example: in niño the ñ is U+OOF1 but it can also be written as "n\x{303}".

To keep accents written this way add \p{Mn} (\p{NonspacingMark}) to the character class

my $string = "El Guapö=_ ni\N{U+00F1}o.* nin\x{303}o+^";
say $string;

(my $nodiac = $string) =~ s/[^\pL ]//g;      #/ naive, accent chars get removed
say $nodiac;

(my $full = $string) =~ s/[^\pL\p{Mn} ]//g;  # add non-spacing mark
say $full;

Output

El Guapö=_  niño.* niño+^
El Guapö niño nino
El Guapö niño niño

So you want s/[^\p{L}\p{Mn} ]//g in order to keep the combining accents.

zdim
  • 64,580
  • 5
  • 52
  • 81
  • 1
    @jm666 Thank you for the comment. I wasn't very concerned with the exact regex, since the OP doesn't say much -- and I thought that the rest is really more important. You are right, need to throw in `\pM` into the character class ... will add, with an example. – zdim May 04 '17 at 09:05
  • I somewhat understand. My ultimate purpose is to untaint CGI input, store in MySQL, then retrieve and use in HTML. My confusion lies in decode/encode. Is it proper to store decoded value in database and encode before use? I need to properly work with the wacky stuff customers enter that I currently strip out. Thanks! – Timothy B. May 04 '17 at 13:55
  • @TimothyB. You've got it backwards. You need to *en*code before storing it in the database and *de*code when you pull it back out again. If you're using DBI and your database and database handle are set up properly, this is done for you. – Matt Jacob May 04 '17 at 14:43
  • Thanks! Here's what I ended up with. DBI connection uses mysql_enable_utf8. Decode untainted param() input. Write to database. Read from database. Encode and this displays properly in HTML. Bonus: I can search MySQL for "nino" (plain text) and it matches. I have this up and running. I mention it here in case I'm still doing something wrong. – Timothy B. May 04 '17 at 15:40
  • @TimothyB. Since it's DB a lot gets done for us as Matt says, but keep your eyes open. In your program you want to decode everything that comes in, for processing, and then encode it for output. Make sure that your HTML specifies encodings as well. What you say sounds good. – zdim May 04 '17 at 18:28
  • @TimothyB. I forgot to say -- I've updated the post, with more suitable references, and with needed `\p{Mn}` (non-spacing mark) instead of `\p{M}` (mark). – zdim May 05 '17 at 06:15