12

I am working on an Android word game with a large dictionary -

app screenshot

The words (over 700 000) are kept as separate lines in a text file (and then put in an SQLite database).

To protect my dictionary, I'd like to encode all words which are longer than 3 chars with md5. (I don't obfuscate short words and words with rare Russian letters ъ and э, because I'd like to list them in my app).

So here is my script which I try to run with perl v5.18.2 on Mac Yosemite:

#!/usr/bin/perl -w

use strict;
use utf8;
use Digest::MD5 qw(md5_hex);

binmode(STDIN, ":utf8");
#binmode(STDOUT, ":raw");
binmode(STDOUT, ":utf8");

while(<>) {
        chomp;
        next if length($_) < 2; # ignore 1 letter junk
        next if /жы/;           # impossible combination in Russian
        next if /шы/;           # impossible combination in Russian

        s/ё/е/g;
    
        if (length($_) <= 3 || /ъ/ || /э/) { # do not obfuscate short words
                print "$_\n";                # and words with rare letters
                next;
        }

        print md5_hex($_) . "\n";            # this line crashes
}

As you can see, I have to use cyrillic letters in the source code of my Perl script - that is why I've put use utf8; on its top.

However my real problem is that length($_) reports too high values (probably reporting number of bytes instead of number of characters).

So I have tried adding:

binmode(STDOUT, ":raw");

or:

binmode(STDOUT, ":utf8");

But the script then dies with Wide character in subroutine entry at the line with print md5_hex($_).

Please help me to fix my script.

I run it as:

perl ./generate-md5.pl < words.txt > encoded.txt

and here is example words.txt data for your convenience:

а
аб
абв
абвг
абвгд
съемка
Alexander Farber
  • 21,519
  • 75
  • 241
  • 416

4 Answers4

21

md5_hex expects a string of bytes for input, but you're passing a decoded string (a string of Unicode Code Points). Explicitly encode the string.

use strict;
use utf8;
use Digest::MD5;
use Encode;
# ....
# $_ is assumed to be utf8 encoded without check
print Digest::MD5::md5_hex(Encode::encode_utf8($_)),"\n";
# Conversion only when required:
print Digest::MD5::md5_hex(utf8::is_utf8($_) ? Encode::encode_utf8($_) : $_),"\n";
AnFi
  • 10,493
  • 3
  • 23
  • 47
5

my real problem is that length($_) reports too high values

Yes, you are reading from the ARGV file handle and haven't set its encoding to UTF-8

You can use the open pragma to fix this. Instead of all your binmode statements, use

use open qw/ :std :encoding(utf8) /;

which will change the default open mode for all filehandles, including the standard ones, to :encoding(utf8)

Borodin
  • 126,100
  • 9
  • 70
  • 144
1

if u use Mojolicious then replace to_json to encode_json will solve the problem.

From the documentation of JSON module, to_json keyword: If you want to write a modern perl code which communicates to outer world, you should use encode_json (supposed that JSON data are encoded in UTF-8). and I can't foresee a non UTF-8 world out there.

Tony Aziz
  • 899
  • 6
  • 4
1

if you are using perl version 5.0 and above then this can be resolved by changing to_json to encode_json

Tony Aziz
  • 899
  • 6
  • 4