2

How to deal with invalid UTF-8 sequences in data from external file / external command, which data is used to generate HTML (in a Perl web app)?

Currently I am running to_utf8() on each piece of data; said subroutine detects if data is invalid UTF-8, and falls back to 'latin1' encoding:

use utf8;
use Encoding;
binmode STDOUT, ':utf8';

sub to_utf8 {
    my $str = shift;
    return undef unless defined $str;
    if (utf8::valid($str)) {
        utf8::decode($str);
        return $str;
    } else {
        return decode($fallback_encoding, $str, Encode::FB_DEFAULT);
    }
}

Please correct me if this code is incorrect.

The (fragment of) recommended setup in Perl Unicode Essentials from Tom Christiansen’s Materials for OSCON 2011 is

use utf8;
use open qw( :encoding(UTF-8) :std );

How to get something similar to what I have using something like above? I'd prefer automatic handling of Unicode, rather than having to remember to mark all output strings from external commands and files with to_utf8().

The data is from external files, or output from external commands, and it should be in UTF-8, but because of user errors it sometimes it isn't.

Jakub Narębski
  • 309,089
  • 65
  • 217
  • 230
  • Maybe this answer provides some insight http://stackoverflow.com/questions/6234386/how-do-i-sanitize-invalid-utf-8-in-perl/6234455#6234455 – matthias krull Aug 12 '11 at 14:04

1 Answers1

3

You can write a custom IO layer that does the "magical" decoding.

Usualy IO layers (like :utf8) are written in XS, but the core module PerlIO::via (see http://search.cpan.org/perldoc?PerlIO::via) allows you to use perl code for that.

moritz
  • 12,710
  • 1
  • 41
  • 63