Comparing two non-ascii strings in perl

Question

I am unable to compare two non-ascii strings, although both the strings appear the same on the console. Below is what I tried. Please let me know what code is missing here, so that the two variables shall be equal.

if($lineContent[7] ne $name) {
  /*Control coming to here*/
  print "###### Values MIS-MATCHED\n";
} else {
  print "###### Values MATCHED\n";
}

$lineContent[7] is from a CSV file

$name is from an XML file

When Putty's console is in the default Characterset

CSV Val: ENB69-åºå°å±
XML Val: ENB69-åºå°å±

When Putty's Console is set to UTF-8

CSV Val: ENB69-基地局
XML Val: ENB69-基地局

Your syntax is correct. Please provide the output of the following: `use Data::Dumper; { local $Data::Dumper::Useqq = 1; local $Data::Dumper::Terse = 1; local $Data::Dumper::Indent = 0; print(Dumper($lineContent[7], $name)); }` — ikegami, Sep 05 '12 at 06:41
@ikegami, thank you for the interest in my post. Below is the result frm the Data Dumper. "ENB13-\345\237\272\345\234\260\345\261\200" and "ENB13-\x{57fa}\x{5730}\x{5c40}" are the results for $lineContent[7] and $name respectively. — cms, Sep 05 '12 at 14:25

astonia · Answer 1 · 2012-09-05T15:11:43.513

#!/usr/bin/perl

use warnings;
use strict;
use Encode;

binmode STDOUT, ":encoding(utf8)";
open F1, "<:utf8", "$ARGV[0]" or die "$!";
open F2, "<", "$ARGV[0]" or die "$!";

my $a1 = <F1>;
chomp $a1;
my $a2 = <F2>;
chomp $a2;

if ($a1 eq $a2) {
    print "$a1=$a2 is true\n";
} else {
    print "$a1=$a2 is false\n";
}

my $b = decode("utf-8", $a2);
if ($a1 eq $b) {
    print "$a1=$b is true\n";
} else { 
    print "$a1=$b is false\n";
}

I wrote a test program listed above. And create a text file with one line: 基地局. When you run the program with this text file, you can get a false and a true. I don't know what's in your program, but I guess the csv file is read as a plain text without any parsers or encode/decode procedures, whereas the xml file must be parsed by some library, so that the internal encoding mechanism is different for the two string variables, including some leading bytes of encoding notation. Simply put, you can try to encode or decode one of the two string variables, and see if they match.

By the way, this is my first answer here, hope it can be a little bit helpful to you ;-)

From your dump results, it's obvious. The first variable stores 9 characters which constrcut 基地局 in utf-8 encoding in its internal structure. The second variable represents 3 characters in its internal structure. They have same byte stream, and are equal in a byte-stream view but not equal in a character-based comparison.

Use decode/encode can solve your problem.

In perl what you see might not what the program stores internally. For example, java uses unicode as internal structure to save strings, but perl can have different byte sequences for same string as your problem suggests. encode/decode may do the conversion, and split like @ikegami 's solution may help. pack/unpack is another method to solve this problem, yet I don't think they are the best choice. — astonia, Sep 06 '12 at 16:53

score 1 · Answer 2 · edited May 23 '17 at 10:34

Personally I would be a little more careful if you know that you are comparing unicode strings. Unicode::Collate is the module for the job.

Of course you should also read tchrist's now-famous SO post on the topic of enabling unicode in Perl, https://stackoverflow.com/a/6163129/468327, but utf8::all does an admirable job of turning on proper unicode support. Note that better unicode handling was added to the Perl core in version 5.14 so I require that here as well.

Finally here is a quick script that does the comparison, of course you would populate the variables by reading the files as needed:

#!/usr/bin/env perl

use v5.14;
use strict;
use warnings;

use utf8::all;
use Unicode::Collate;

my $collator = Unicode::Collate->new;

my $csv = "ENB69-基地局";
my $xml = "ENB69-基地局";

say $collator->eq($csv, $xml) ? "equal" : "unequal";

Thanks for your response. Since I could not upgrade to 5.14, I skipped trying this solution. — cms, Sep 06 '12 at 07:20

score 1 · Accepted Answer · answered Sep 05 '12 at 15:31

1

Your inputs:

"ENB13-\345\237\272\345\234\260\345\261\200"
"ENB13-\x{57fa}\x{5730}\x{5c40}"

As you can see, these are clearly not the same. Specifically, the first is the UTF-8 encoding of the other. Always decode inputs. Always encode outputs.

use strict;
use warnings;

use utf8;                             # Source code is saved as UTF-8
use open ':std', ':encoding(UTF-8)';  # Terminal expects UTF-8

my $name = "ENB69-基地局";

while ($line = <STDIN>) {
   chomp;
   my @lineContent = split /\t/, $line;
   print($lineContent[7] eq $name ?1:0, "\n");  # 1
}

answered Sep 05 '12 at 15:31

ikegami

367,544
15
269
518

It worked perfectly. Thanks for the help. BTW, can you please say, what is the concept underlying in your approach. I am coming from an experience where I know what we see in a text or xml file, is what the Program reads. – cms Sep 06 '12 at 07:20
The driving concept is "Always decode inputs. Always encode outputs." If your database gives you encoded text, decode it. If your XML parser gives you encoded text, decode it, etc. – ikegami Sep 06 '12 at 14:34

Comparing two non-ascii strings in perl

3 Answers3