2

Reading a utf8 encoded file after seek as in open(FILE, '<:utf8', $file) or die; seek(FILE, $readFrom, 0); read(FILE, $_, $size); sometimes "breaks up" a unicode char so the beginning of the read string is not valid UTF-8.

If you then do e.g. s{^([^\n]*\r?\n)}{}i to strip the incomplete first line, you get "Malformed UTF-8 character (fatal)" errors.

How to fix this?

One solution, listed in How do I sanitize invalid UTF-8 in Perl? is to remove all invalid UTF-8 chars:

tr[\x{9}\x{A}\x{D}\x{20}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}][]cd;

However, to search the entire string seems like overkill, as it is only the first byte(s) in the read string that can be broken.

Can anyone suggest a way to strip only an initial invalid char (or make the above substitution not die on malformed UTF-8)?

Community
  • 1
  • 1
W3Coder
  • 620
  • 8
  • 20
  • apply your `tr` to only the first character? – Sobrique Nov 02 '15 at 10:25
  • This shouldn't happen, see ``perldoc -f read`` *Note the characters: ...By default all filehandles operate on on bytes, but...if the filehandle has been opened with the ":utf8" I/O layer the I/O will operate on UTF-8 encoded Unicode characters, not bytes* Please give a minimal example of this happening – Vorsprung Nov 02 '15 at 10:32
  • That `tr` strips out at least 29 valid characters!!!! – ikegami Nov 02 '15 at 13:11

1 Answers1

4

Read the stream as bytes, strip out partial characters at the start, determine where the last full character ends, then decode what's left.

use Encode qw( STOP_AT_PARTIAL );
use Fcntl  qw( SEEK_TO );

my $encoding = Encode::find_encoding('UTF-8');

open(my $FILE, '<:raw', $file) or die $!;
seek($FILE, $readFrom, SEEK_TO) or die $!;
my $bytes_read = read($FILE, my $buf, $size);
defined($bytes_read) or die $!;

$buf =~ s/^[\x80-\xBF]+//;

my $str = $encoding->decode($buf, STOP_AT_PARTIAL);

If you want to read more, use the 4-arg form of read, and don't skip anything at the start this time.

my $bytes_read = read($FILE, $buf, $size, length($buf));
defined($bytes_read) or die $!;

$str .= $encoding->decode($buf, STOP_AT_PARTIAL);

Related reading: Convert UTF-8 byte stream to Unicode

Community
  • 1
  • 1
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • `s/^\x80+//`? That's not how UTF-8 works. Continuation bytes can be stripped with `s/^[\x80-\xBF]+//`. – nwellnhof Nov 02 '15 at 15:30
  • @nwellnhof, ack! of course! (I knew that, as evidenced by my Answer to the linked question.) Fixed. – ikegami Nov 02 '15 at 18:12