A solution that works with streams:
use READ_SIZE => 64*1024;
my $buf = '';
while (1) {
my $rv = sysread($fh, $buf, READ_SIZE, length($buf));
die("Read error: $!\n") if !defined($rv);
last if !$rv;
while (length($buf)) {
if ($buf =~ s/
^
( [\x00-\x7F]
| [\xC2-\xDF] [\x80-\xBF]
| \xE0 [\xA0-\xBF] [\x80-\xBF]
| [\xE1-\xEF] [\x80-\xBF] [\x80-\xBF]
| \xF0 [\x90-\xBF] [\x80-\xBF] [\x80-\xBF]
| [\xF1-\xF7] [\x80-\xBF] [\x80-\xBF] [\x80-\xBF]
)
//x) {
# Something valid
my $utf8 = $1;
utf8::decode( my $ucp = $utf8 );
handle($utf8, $ucp);
}
elsif ($buf =~ s/
^
(?: [\xC2-\xDF]
| \xE0 [\xA0-\xBF]?
| [\xE1-\xEF] [\x80-\xBF]?
| \xF0 (?: [\x90-\xBF] [\x80-\xBF]? )?
| [\xF1-\xF7] (?: [\x80-\xBF] [\x80-\xBF]? )?
)
\z
//x) {
# Something possibly valid
last;
}
else {
# Something invalid
handle(substr($buf, 0, 1, ''), "\x{FFFD}");
}
}
while (length($buf)) {
handle(substr($buf, 0, 1, ''), "\x{FFFD}");
}
The above only returns U+FFFD for what Encode::decode('UTF-8', $bytes)
considered ill-formed. In other words, it only returns U+FFFD when it encounters on of the following:
- An unexpected continuation byte.
- A start byte not followed by enough continuation bytes.
- The first byte of an "overlong" encoding.
Post-decoding checks are still needed to return U+FFFD for what Encode::decode('UTF-8', $bytes)
considers otherwise illegal.