1

I'm trying to read lines from the first part of a file that contains a text header encoded in the cp1252 encoding, and contains binary data after a specific keyword.

Problem

Perl warns about invalid encoding in parts of the file I never read. I've created an example in two files to demonstrate the problem.

Contents of linebug.pl:

#!/usr/bin/perl
use 5.028;
use strict;
use warnings;
open( my $fh, "<:encoding(cp1252)", "testfile" );
while( <$fh> ) {
    print;
    last if /Last/;
}

Hexdump of testfile, where the byte 0x81 right after the text Wrong is purposefully added because it is not a valid cp1252 codepoint:

46 69 72 73 74 0a         |First.|
4c 61 73 74 0a            |Last.|
42 75 66 66 65 72 0a      |Buffer.|
57 72 6f 6e 67 81 0a      |Wrong..|

The third line Buffer is just there to make it clear that I do not read too far. It is a valid line between the last line I read, and the "binary" data.

Here is the output showing that I only ever read two lines, but perl still emits a warning:

user@host$ perl linebug.pl
cp1252 "\x81" does not map to Unicode at ./linebug.pl line 6.
First
Last
user@host$

As can be seen, my program reads and prints the first two lines, and then exits. It should never try to read and interpret anything else, but I still get the warning about \x81 not mapping to Unicode.

Questions

  • Why does it warn? I'm not reading the line. A hunch tells me it's trying to read ahead, but why would it try to decode?
  • Is there a workaround, or a better way to handle files where the encoding changes from one section to another?

I still want the warning when reading the initial lines, in case the file is damaged.

pipe
  • 657
  • 10
  • 27
  • Presumably it does encoding conversion when reading from the file into an internal buffer, not when data is actually returned to the program from that buffer. – Shawn Jun 06 '19 at 23:44

2 Answers2

3

Files don't have a concept of lines; they are just streams of bytes. Perl must request a number of bytes from the file from the OS and figure out where the line ends in order to return a line to the program.

Perl could request a single byte at a time from the OS until it has a full line, but that would be very inefficient. There's a lot of overhead involved in making system calls. As such, Perl requests 8 KiB at a time.

Then, the raw data must be decoded before Perl can determine where the line ends, because a raw 0A doesn't necessarily indicate the end of the line.

Similarly to why one doesn't read from a file one byte at a time, asking the decoder to decode just the next character would be inefficient. There is overhead involved every time you start and stop decoding. As such, Perl decodes all the data it reads as it reads it.

So that means that Perl both reads and decodes more than it returns to the program.


The solution is to treat the file as binary (because it's not really a text file if the encoding changes by section) and do the decoding yourself.

If you're dealing with a single-byte encoding like cp1252, you can continue using readline (aka <$fh>). However, instead of telling Perl to search for the Code Point of Line Feed (0A), you need to set $/ to the encoding of the Code Point. As it happens, that's also 0A for cp1252, so no change is needed.

use Encode qw( decode );

open( my $fh, "<:raw", $qfn )
   or die( "Can't open \"$qfn\": $!\n" );

while( <$fh> ) {
    $_ = decode( 'cp1252', $_ );      # :encoding(cp1252)
    s/\r\n\z/\n/ if $^O eq 'Win32';   # :crlf
    print;
    last if /Last/;
}

If you weren't using a single-byte encoding, you might have to switch to using read. (You could keep using readline for UTF-8 because of the way it's designed.) When using read, the exact solution depends on a few specifics (that pertain to determining how much to read and how much to decode).

ikegami
  • 367,544
  • 15
  • 269
  • 518
1

Perl reads from the file in 8 KiB chunks, so way more than a line is read at a time. Data is decoded right as it is read (since the stream must be decoded to find the line endings), so an unexpected encoding is noticed and warned about.

One way to deal with this: use non-buffered reads, via sysread, and read smaller chunks at a time.

Count characters read and once you run into that spot you can back up and continue reading character at a time, again counting them, so to detect the exact place. See this post for a working example of identifying the spot where a warning is fired.

In order to be able to stop there you'll likely want to throw a die out of a $SIG{__WARN__} handler, and have all that code in eval. This will allow you to stop at the place where the warning comes from and have control back.

As you've read right up to that spot, you can then re-open the file in the encoding suitable for the rest of the file and seek to that spot and read the rest.

I can't write and test all that right now, hopefully this helps.

ikegami
  • 367,544
  • 15
  • 269
  • 518
zdim
  • 64,580
  • 5
  • 52
  • 81
  • Re "*use non-buffered reads, via sysread, and read smaller chunks at a time.*", This is all wrong. The solution isn't switching to unbuffered; the solution is switching to doing the decoding yourself. You can continue to use buffered I/O (`read` and even `readline` aka `<>`), but decoding on what you should be decoding. The chunk size is irrelevant. – ikegami Jun 07 '19 at 08:29
  • @ikegami You mean to read (with `:raw` or silence warnings) and then go through it and (Encode::)decode (quietly)? OK, I've seen that. Will still have some work to do to detect where things change etc. But why is this "all wrong"? Thank you for the edit. – zdim Jun 07 '19 at 08:42
  • Re "*You mean to read (with :raw or silence warnings) and then go through it and (Encode::)decode (quietly)?*", Yes (except for silence warning part). Isn't that what you meant? (After all, you can't use `:encoding` with `sysread`) – ikegami Jun 07 '19 at 08:50
  • Re "*But why is this "all wrong"?*", It's irrelevant whether the read is buffered or not. Switching to unbuffered reads doesn't help; it's switching to doing the decoding yourself that does. – ikegami Jun 07 '19 at 08:58
  • @ikegami With unbuffered reads it doesn't (attempt to) decode further than what I ask it to read, so with smaller reads I can better locate the spot where the encoding changes. (However, now I see that that happens "_after a specific keyword_", which i somehow missed and thought that the code needs to discover that location). I do like a solution that decodes manually as it goes, of course. – zdim Jun 07 '19 at 09:05
  • @ikegami "_can't use `:encoding` with `sysread`_" -- why not? I think that the end of the `sysread` page implies that it's OK? – zdim Jun 07 '19 at 09:07
  • Re "*With unbuffered reads it doesn't (attempt to) decode further than what I ask it to read*", Not quite. With unbuffered reads it doesn't (attempt to) decode further than what I ask it to read because you can't use `:encoding` with unbuffered reads. The solution isn't "*remove `:encoding` and use unbuffered reads*", the solution is just "*remove `:encoding`*". – ikegami Jun 07 '19 at 09:12
  • You definitely can't use if for multi-byte encodings (like UTF-8) because there's no buffer to store partially decoded characters. But it doesn't work for cp1252 either because the length is a length *in bytes*. Try `perl -e'print chr(0x80)' | perl -MData::Dumper -e'binmode(STDIN, ":encoding(cp1252)"); read(\*STDIN, $buf, 1); $Data::Dumper::Useqq=1; print(Dumper($buf))'`, then swtich to `read` to `sysread`. – ikegami Jun 07 '19 at 09:21
  • The [docs](https://metacpan.org/pod/distribution/perl/pod/perlfunc.pod#sysread-FILEHANDLE,SCALAR,LENGTH,OFFSET) say "*Note that if the filehandle has been marked as `:utf8`, `sysread` will throw an exception. The `:encoding(...)` layer implicitly introduces the `:utf8` layer.*". The exception is new. The fixed docs is new. The fact that it doesn't work is not new. – ikegami Jun 07 '19 at 09:24
  • @ikegami yeah, I'll have to look into that way better (you mean that with `sysread` I can't read encodings other than ASCII?). As for the docs, what it says now to me means that UTF-8 does work, and it _may_ imply that other encodings would (but it doesn't say so, only mentions `:encoding`). If that's wrong than all this is off to start with of course. – zdim Jun 07 '19 at 09:24
  • @ikegami "_The docs say..._" --- ah, that's news to me. I got to 5.26 (now i see) with perldoc. I remember using it (5.10 perhaps) and it worked ... or so it seemed. (I may have missed to notice that it didn't) – zdim Jun 07 '19 at 09:27
  • Re "*you mean that with sysread I can't read encodings other than ASCII?*", With sysread, you can only read bytes. If those bytes represent text encoded using ASCII and you pass them to a sub that expected text using ASCII (or a superset of ASCII such as ISO-8859-* or UTF-8), you're good. – ikegami Jun 07 '19 at 09:27
  • @ikegami "_With sysread, you can only read bytes_" -- um, yes, right of course. I see what you meant – zdim Jun 07 '19 at 09:28
  • [You just failed to notice.](https://pastebin.com/hCT1i8xc) It literally produces a corrupt scalar. – ikegami Jun 07 '19 at 09:29
  • @ikegami Thank you again, I'll have to look into this more carefully. (Regardless of that, as I said, I like the approach to read raw and decode as it goes, +1 for your post) – zdim Jun 07 '19 at 09:30
  • Related reads: [What is the difference between `read` and `sysread`?](https://stackoverflow.com/q/36315124/589924) and [`:encoding(UTF-8)` vs `:encoding(utf8)` vs `:utf8`](https://stackoverflow.com/a/49040165/589924) – ikegami Jun 07 '19 at 09:32
  • @ikegami "_Related reads_" -- oh yes, well aware of those :) Thanks for bringing them up, should go through it again (for example, I apparently never took notice of "_`sysread` doesn't support PerlIO layers_") – zdim Jun 07 '19 at 09:33
  • @ikegami "_Just failed to notice_" -- that `\x80` (really) `does not map to Unicode` (I get that when I change `sysread` to `read` and encoding to UTF-8), so it doesn't show that encoding fails? But I see the point of course. Thank you – zdim Jun 07 '19 at 09:35
  • I don't get `\x80 does not map to Unicode`. Nor should you. 80[cp1252] maps to 20AC[Unicode] (€) – ikegami Jun 07 '19 at 09:37
  • I change encoding in your example to UTF-8 (since that's what I believed should work for sure) – zdim Jun 07 '19 at 09:38
  • I'm not sure I understand your question. You gave bad data, and you got a message saying the decode failed. – ikegami Jun 07 '19 at 09:39
  • @ikegami (I meant that it's not valid unicode anyway so of course it fails to decode when UTF-8 encoding is set) But i should look into this way more, before talking. thank you again – zdim Jun 07 '19 at 09:44