3

I have a string like below

stringinput = Sweééééôden@

I want to get output like

stringoutput = Sweden

the spl characters ééééô and @ has to be removed.

Am using

$stringoutput = `echo $stringinput | sed 's/[^a-z  A-Z 0-9]//g'`;

I am getting result like Sweééééôden but ééééô is not getting removed.

Can you please suggest what I have to add

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
sravani srinija
  • 73
  • 1
  • 1
  • 7

2 Answers2

4

No need to call sed from Perl, perl can do the substitution itself. It's also faster, as you don't need to start a new process.

#!/usr/bin/perl
use warnings;
use strict;
use utf8;

my $string = 'Sweééééôden@';
$string =~ s/[^A-Za-z0-9]//g;
print $string;
choroba
  • 231,213
  • 25
  • 204
  • 289
4

You need to use LC_ALL=C before sed command to make [A-Za-z] character class create ranges as per ASCII table:

stringoutput=$(echo $stringinput | LC_ALL=C sed 's/[^A-Za-z0-9]//g')

See the online demo:

stringinput='Sweééééôden@';
stringoutput=$(echo $stringinput | LC_ALL=C sed 's/[^A-Za-z0-9]//g');
echo "$stringoutput";
# => Sweden

See POSIX regex reference:

In the default C locale, the sorting sequence is the native character order; for example, ‘[a-d]’ is equivalent to ‘[abcd]’. In other locales, the sorting sequence is not specified, and ‘[a-d]’ might be equivalent to ‘[abcd]’ or to ‘[aBbCcDd]’, or it might fail to match any character, or the set of characters that it matches might even be erratic. To obtain the traditional interpretation of bracket expressions, you can use the ‘C’ locale by setting the LC_ALL environment variable to the value ‘C’.

In Perl, you could simply use

my $stringinput = 'Sweééééôden@';
my $stringoutput = $stringinput =~ s/[^A-Za-z0-9]+//gr;
print $stringoutput;

See this online demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks a lot . my $stringoutput = $stringinput =~ s/[^A-Za-z0-9]+//gr; this worked. But when I added / to regex .. since I dont want / occurance from stringinput like PO BOX 29794/MACééééô S3929-033 It throwed an error below Unmatched [ in regex; marked by <-- HERE in m/[ <-- HERE ^A-Za-z0-9 / at splchar.pl line 11. I have used below regex $stringout = $stringin =~ s/[^A-Za-z0-9 /-.,]+//gr; – sravani srinija Mar 03 '21 at 08:29
  • @sravanisrinija You used `-` inside the brackets, right? Escape it (or place at the end of the bracket expression). Also, escape `/`. `my $stringoutput = $stringinput =~ s/[^A-Za-z0-9 .,\/-]+//gr;` – Wiktor Stribiżew Mar 03 '21 at 09:27
  • Note: Both assume an NFC string and fail for NFD strings (producing Sweeeeeoden instead of Sweden). See [Unicode Equivalence](https://en.wikipedia.org/wiki/Unicode_equivalence) – ikegami Mar 04 '21 at 21:13