For the direct question, you may simply need \p{L}
(Letter) Unicode Character Property
However, more importantly, decode all input and encode output.
use warnings;
use strict;
use feature 'say';
use utf8; # allow non-ascii (UTF-8) characters in the source
use open ':std', ':encoding(UTF-8)'; # for standard streams
use Encode qw(decode_utf8); # @ARGV escapes the above
my $string = 'El Guapö';
if (@ARGV) {
$string = join ' ', map { decode_utf8($_) } @ARGV;
}
say "Input: $string";
$string =~ s/[^\p{L} ]//g;
say "Processed: $string";
When run as script.pl 123 El Guapö=_
Input: 123 El Guapö=_
Processed: El Guapö
I've used the "blanket" \p{L}
property (Letter), as specific description is lacking; adjust if/as needed. The Unicode properties provide a lot, see the link above and the complete list at perluniprops.
The space between 123 El
remains, perhaps strip leading (and trailing) spaces in the end.
Note that there is also \P{L}
, where the capital P
indicates negation.
The above simple-minded \pL
won't work with Combining Diacritical Marks, as the mark will be removed as well. Thanks to jm666 for pointing this out.
This happens when an accented "logical" character (extended grapheme cluster, what appears as a single character) is written using separate characters for its base and for non-spacing mark(s) (combining accents). Often a single character for it with its codepoint also exists.
Example: in niño
the ñ
is U+OOF1
but it can also be written as "n\x{303}"
.
To keep accents written this way add \p{Mn}
(\p{NonspacingMark}
) to the character class
my $string = "El Guapö=_ ni\N{U+00F1}o.* nin\x{303}o+^";
say $string;
(my $nodiac = $string) =~ s/[^\pL ]//g; #/ naive, accent chars get removed
say $nodiac;
(my $full = $string) =~ s/[^\pL\p{Mn} ]//g; # add non-spacing mark
say $full;
Output
El Guapö=_ niño.* niño+^
El Guapö niño nino
El Guapö niño niño
So you want s/[^\p{L}\p{Mn} ]//g
in order to keep the combining accents.