I'm trying to read lines from the first part of a file that contains a text header encoded in the cp1252 encoding, and contains binary data after a specific keyword.
Problem
Perl warns about invalid encoding in parts of the file I never read. I've created an example in two files to demonstrate the problem.
Contents of linebug.pl:
#!/usr/bin/perl
use 5.028;
use strict;
use warnings;
open( my $fh, "<:encoding(cp1252)", "testfile" );
while( <$fh> ) {
print;
last if /Last/;
}
Hexdump of testfile, where the byte 0x81
right after the text Wrong is purposefully added because it is not a valid cp1252 codepoint:
46 69 72 73 74 0a |First.|
4c 61 73 74 0a |Last.|
42 75 66 66 65 72 0a |Buffer.|
57 72 6f 6e 67 81 0a |Wrong..|
The third line Buffer is just there to make it clear that I do not read too far. It is a valid line between the last line I read, and the "binary" data.
Here is the output showing that I only ever read two lines, but perl still emits a warning:
user@host$ perl linebug.pl
cp1252 "\x81" does not map to Unicode at ./linebug.pl line 6.
First
Last
user@host$
As can be seen, my program reads and prints the first two lines, and then exits. It should never try to read and interpret anything else, but I still get the warning about \x81
not mapping to Unicode.
Questions
- Why does it warn? I'm not reading the line. A hunch tells me it's trying to read ahead, but why would it try to decode?
- Is there a workaround, or a better way to handle files where the encoding changes from one section to another?
I still want the warning when reading the initial lines, in case the file is damaged.