9

I am trying to improve the warning message issued by Encode::decode(). Instead of printing the name of the module and the line number in the module, I would like it to print the name of the file being read and the line number in that file where the malformed data was found. To a developer, the origial message can be useful, but to an end user not familiar with Perl, it is probably quite meaningless. The end user would probably rather like to know which file is giving the problem.

I first tried to solve this using a $SIG{__WARN__} handler (which is probably not a good idea), but I get a segfault. Probably a silly mistake, but I could not figure it out:

#! /usr/bin/env perl

use feature qw(say);
use strict;
use warnings;

use Encode ();

binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';

my $fn = 'test.txt';
write_test_file( $fn );

# Try to improve the Encode::FB_WARN fallback warning message :
#
#   utf8 "\xE5" does not map to Unicode at <module_name> line xx
#
# Rather we would like the warning to print the filename and the line number:
#
#   utf8 "\xE5" does not map to Unicode at line xx of file <filename>.

my $str = '';
open ( my $fh, "<:encoding(utf-8)", $fn ) or die "Could not open file '$fn': $!";
{
    local $SIG{__WARN__} = sub { my_warn_handler( $fn, $_[0] ) }; 
    $str = do { local $/; <$fh> };
}
close $fh;
say "Read string: '$str'";

sub my_warn_handler {
    my ( $fn, $msg ) = @_;

    if ( $msg =~ /\Qdoes not map to Unicode\E/ ) {
        recover_line_number_and_char_pos( $fn, $msg );
    }
    else {
        warn $msg;
    }
}

sub recover_line_number_and_char_pos {
    my ( $fn, $err_msg ) = @_;

    chomp $err_msg;
    $err_msg =~ s/(line \d+)\.$/$1/;  # Remove period at end of sentence.
    open ( $fh, "<:raw", $fn ) or die "Could not open file '$fn': $!";
    my $raw_data = do { local $/; <$fh> };
    close $fh;
    my $str = Encode::decode( 'utf-8', $raw_data, Encode::FB_QUIET );
    my ($header, $last_line) = $str =~ /^(.*\n)([^\n]*)$/s; 
    my $line_no = $str =~ tr/\n//;
    ++$line_no;
    my $pos = ( length $last_line ) + 1;
    warn "$err_msg, in file '$fn' (line: $line_no, pos: $pos)\n";
}

sub write_test_file {
    my ( $fn ) = @_;

    my $bytes = "Hello\nA\x{E5}\x{61}";  # 2 lines ending in iso 8859-1: åa
    open ( my $fh, '>:raw', $fn ) or die "Could not open file '$fn': $!";
    print $fh $bytes;
    close $fh;
}

Output:

utf8 "\xE5" does not map to Unicode at ./p.pl line 27
, in file 'test.txt' (line: 2, pos: 2)
Segmentation fault (core dumped)
Håkon Hægland
  • 39,012
  • 21
  • 81
  • 174
  • Perhaps my_warn_handler goes into infinite recursion. Try putting `local $SIG{__WARN__};` inside my_warn_handler to restore the default behavior? – Waxrat Dec 28 '16 at 11:24
  • @Waxrat According to the [documentation](http://perldoc.perl.org/functions/warn.html), that should not happen: *"Most handlers must therefore arrange to actually display the warnings that they are not prepared to deal with, by calling warn again in the handler. Note that this is quite safe and will not produce an endless loop, since \__WARN\__ hooks are not called from inside one."* – Håkon Hægland Dec 28 '16 at 11:28
  • 5
    Think I found the error: I forgot to declare the `$fh` as lexical within the handler. So it reopens the lexical `$fh` defined in the enclosing scope and later closes that handle. So this is probably causing some confusion to `Encode::decode()`.. – Håkon Hægland Dec 28 '16 at 11:47

1 Answers1

1

Here is another way to locate where the warning fires, with un-buffered sysread

use warnings;
use strict;

binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';

my $file = 'test.txt';
open my $fh, "<:encoding(UTF-8)", $file or die "Can't open $file: $!";

$SIG{__WARN__} = sub { print "\t==> WARN: @_" };

my $char_cnt = 0;    
my $char;

while (sysread($fh, $char, 1)) {
    ++$char_cnt;
    print "$char ($char_cnt)\n";
}

The file test.txt was written by the posted program, except that I had to add to it to reproduce the behavior -- it runs without warnings on v5.10 and v5.16. I added \x{234234} to the end. The line number can be tracked with $char =~ /\n/.

The sysread returns undef on error. It can be moved into the body of while (1) to allow reads to continue and catch all warnings, breaking out on 0 (returned on EOF).

This prints

H (1)
e (2)
l (3)
l (4)
o (5)

 (6)
A (7)
å (8)
a (9)
        ==> WARN: Code point 0x234234 is not Unicode, may not be portable at ...
 (10)

While this does catch the character warned about, re-reading the file using Encode may well be better than reaching for sysread, in particular if sysread uses Encode.

However, Perl is utf8 internally and I am not sure that sysread needs Encode.

Note. The page for sysread supports its use on data with encoding layers

Note that if the filehandle has been marked as :utf8 Unicode characters are read instead of bytes (the LENGTH, OFFSET, and the return value of sysread are in Unicode characters). The :encoding(...) layer implicitly introduces the :utf8 layer. See binmode, open, and the open pragma.


Note   Apparently, things have moved on and after a certain version sysread does not support encoding layers. The link above, while for an older version (v5.10 for one) indeed shows what is quoted, with a newer version tells us that there'll be an exception.

zdim
  • 64,580
  • 5
  • 52
  • 81
  • Interesting. I have not checked the source yet, but according to [PerlIO::encoding](http://perldoc.perl.org/PerlIO/encoding.html) it seems that `readline` will call `Encode::decode()`, so the warning should come from `Encode` then? – Håkon Hægland Dec 29 '16 at 10:58
  • @HåkonHægland The `sysread` uses `read(2)`, it's about the lowest level. The docs do say that it reads unicode if the filehandle is "marked" `:utf8` but I don't know that it would use `Encode` ...? May remove the first sentence until I clear it up. My main point was to locate the warning by 'direct' reading. – zdim Dec 29 '16 at 12:02
  • @HåkonHægland This is indeed tricky and I am not sure whether `sysread` needs/uses `Encode`. I adjusted the answer accordingly. – zdim Dec 29 '16 at 12:25
  • Using `sysread` on a handle with encoding layers sounds like a bad idea. – melpomene Dec 29 '16 at 15:23
  • @melpomene The page for `sysread` specifically supports it, in the closing paragraph. – zdim Dec 29 '16 at 20:04
  • @zdim I suspect that's more of an implementation accident than an intentional feature. It's also unclear: For example, `:encoding(latin-9)` shouldn't really be doing anything with Unicode. – melpomene Dec 29 '16 at 20:17
  • @melpomene May be. However, internally it is all utf8 so low level reads need to support that. I was trying to locate the warning and didn't think of `sysread` resorting to `Encode`, and I don't know that it does. If the warning does come from reading this directly than I'd rather catch it there. – zdim Dec 29 '16 at 20:23