Convert UTF-8 byte stream to Unicode

Question

How can I easily create a mapping from a UTF-8 bytestream to a Unicode codepoint array? To clarify, if for example I have the byte sequence:

c3 a5 76 aa e2 82 ac

The mapping should produce two arrays of the same length; one with UTF-8 byte sequences, and the other with the corresponding Unicode codepoint. Then, the arrays could be printed side-by-side like:

UTF8                UNICODE             
----------------------------------------
C3 A5               000000E5            
76                  00000076            
AA                  0000FFFD            
E2 82 AC            000020AC

Unicode to UTF8 mappings are not one-to-one. For example, if we take `AA` as mapping to the replacement character as per your table, then so do a range of other codes. — Alohci, Jul 19 '15 at 18:21
@Alohci Yes you are right, so I will remove `one-to-one` mapping and replace with `mapping`. Thanks. — Håkon Hægland, Jul 19 '15 at 18:23
This question is confusing because that’s not a valid UTF-8 string. What are you really trying to do? If you just want to read a UTF-8 stream, set its encoding. But I’m sure you know that, so please explain the real goal here. — tchrist, Jul 19 '15 at 20:09
@tchrist For example if I run `hexdump -C` on a file. It would be nice to be able to display the corresponding Unicode characters. The reason I but in the invalid UTF-8 byte, was to indicate that the mapping should also be able to handle invalid UTF8 in a sensible manner.. — Håkon Hægland, Jul 19 '15 at 20:33

ikegami · Accepted Answer · 2015-07-20T09:49:24.013

4

A solution that works with streams:

use READ_SIZE => 64*1024;

my $buf = '';
while (1) {
   my $rv = sysread($fh, $buf, READ_SIZE, length($buf));
   die("Read error: $!\n") if !defined($rv);
   last if !$rv;

   while (length($buf)) {
      if ($buf =~ s/
         ^
         ( [\x00-\x7F]
         | [\xC2-\xDF] [\x80-\xBF]
         | \xE0        [\xA0-\xBF] [\x80-\xBF]
         | [\xE1-\xEF] [\x80-\xBF] [\x80-\xBF]
         | \xF0        [\x90-\xBF] [\x80-\xBF] [\x80-\xBF]
         | [\xF1-\xF7] [\x80-\xBF] [\x80-\xBF] [\x80-\xBF]
         )
      //x) {
         # Something valid
         my $utf8 = $1;
         utf8::decode( my $ucp = $utf8 );
         handle($utf8, $ucp);
      }

      elsif ($buf =~ s/
         ^
         (?: [\xC2-\xDF]
         |   \xE0            [\xA0-\xBF]?
         |   [\xE1-\xEF]     [\x80-\xBF]?
         |   \xF0        (?: [\x90-\xBF] [\x80-\xBF]? )?
         |   [\xF1-\xF7] (?: [\x80-\xBF] [\x80-\xBF]? )?
         )
         \z
      //x) {
         # Something possibly valid
         last;
      }

      else {
         # Something invalid
         handle(substr($buf, 0, 1, ''), "\x{FFFD}");
      }
}

while (length($buf)) {
   handle(substr($buf, 0, 1, ''), "\x{FFFD}");
}

The above only returns U+FFFD for what Encode::decode('UTF-8', $bytes) considered ill-formed. In other words, it only returns U+FFFD when it encounters on of the following:

An unexpected continuation byte.
A start byte not followed by enough continuation bytes.
The first byte of an "overlong" encoding.

Post-decoding checks are still needed to return U+FFFD for what Encode::decode('UTF-8', $bytes) considers otherwise illegal.

edited Jul 20 '15 at 09:49

answered Jul 20 '15 at 01:27

ikegami

367,544
15
269
518

Your solution will not work with encoded code points in the following sets: `[U+2000, U+D7FF]` (prefix code unit `[E2, ED]`), `[U+E000, U+FFFF]` (prefix code unit `[EE, EF]`) and `[U+80000, U+10FFFF]` (prefix code unit `[F2, F4]`). – chansen Jul 20 '15 at 07:39
@chansen, Fixed. Testing obviously needed. It still demonstrates the solution. – ikegami Jul 20 '15 at 08:34
Better, but still not good enough! You need to reject encoded surrogates `[U+D800, U+DFFF]` which is Ill-formed UTF-8. `utf8::decode()` decodes Perl's internal format, UTF-X not UTF-8. – chansen Jul 20 '15 at 09:09
@chansen, 1) They're not ill-formed, (they may or may not be allowed in certain contexts, but they're not ill-formed) 2) Encode doesn't consider those errors either, 3) The internal format is called "utf8", 4) I don't care if I don't support EBCDIC machines. – ikegami Jul 20 '15 at 09:10
Encode's UTF-8 implementation either reject or replaces (depending on passed flags) encoded surrogates! `$ perl -MEncode -E 'printf "U+%.4X\n", ord Encode::decode("UTF-8", "\xED\xBF\xBF");'` outputs `U+FFFD` – chansen Jul 20 '15 at 09:14
@chansen, Oh, I used `decode_utf8`, and it's the equivalent to `decode 'utf8'`, not `decode 'UTF-8'`. Anyway, my answer explicitly lists when U+FFFD is returned. If you want to perform post-decoding checks, you're free to do so. It doesn't affect the decoding at all. – ikegami Jul 20 '15 at 09:18
The question is about decoding UTF-8 and encoded surrogates are Ill-formed! The first conformance requirement in the Unicode standard says 'C1 A process shall not interpret a high-surrogate code point or a low-surrogate code point as an abstract character.' – chansen Jul 20 '15 at 09:24
The Unicode standard is very clear on the matter, they provide a table which specifies all possible well-formed UTF-8 byte pattern sequences and explicitly says that any UTF-8 byte sequence that doesn't match the patterns in the table is ill-formed! I'm not going to debate this any further, I suggest you read [The Unicode Standard – Core Specification](http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf). – chansen Jul 20 '15 at 09:42
@chansen, Indeed it does. Adjusted my answer to indicate I do what `decode` does rather than what Unicode requires. I suggest you file a bug report with Encode for returning FFFD FFFD for C0 80 but only FFFD for F5 80 80 80 and for EF BF BD. – ikegami Jul 20 '15 at 09:54
1

Encode's behavior is inconsistent but permitted by the Unicode Standard which says (page 127 in the previously linked document) `Although a UTF-8 conversion process is required to never consume well-formed subse- quences as part of its error handling for ill-formed subsequences, such a process is not otherwise constrained in how it deals with any ill-formed subsequence itself.`. If you want an implementation that implements Unicode's Best Practices for Using U+FFFD, you can use my implementation, [Unicode::UTF8](https://metacpan.org/pod/Unicode::UTF8). – chansen Jul 20 '15 at 10:21

score 3 · Answer 2 · answered Jul 20 '15 at 08:10

Encode has an API for incremental decoding but it's undocumented, Your mileage may vary! It's used by subclasses of Encode::Encoding and PerlIO::encoding. As with any undocumented API it's a subject to change at any time. There has been an effort to document the API.

#!/usr/bin/perl
use strict;
use warnings;

use Encode qw[STOP_AT_PARTIAL];

my $encoding = Encode::find_encoding('UTF-8');

my @octets = map { pack 'C', hex } qw<C3 A5 76 AA E2 82 AC F0 9F 90 A2>;
my $buffer = '';
while (@octets) {
    my $octets = $buffer . shift @octets;

    printf "--> processing: <%s>\n", 
      join ' ', map { sprintf '%.2X', ord } split //, $octets;

    my $string = $encoding->decode($octets, STOP_AT_PARTIAL);

    $buffer = $octets;

    if (length $buffer) {
        printf "buffered code units: <%s>\n", 
          join ' ', map { sprintf '%.2X', ord } split //, $buffer;
    }

    if (length $string) {
        printf "received code points: <%s>\n",
          join ' ', map { sprintf 'U+%.4X', ord } split //, $string;
    }
}

Output:

--> processing: <C3>
buffered code units: <C3>
--> processing: <C3 A5>
received code points: <U+00E5>
--> processing: <76>
received code points: <U+0076>
--> processing: <AA>
received code points: <U+FFFD>
--> processing: <E2>
buffered code units: <E2>
--> processing: <E2 82>
buffered code units: <E2 82>
--> processing: <E2 82 AC>
received code points: <U+20AC>
--> processing: <F0>
buffered code units: <F0>
--> processing: <F0 9F>
buffered code units: <F0 9F>
--> processing: <F0 9F 90>
buffered code units: <F0 9F 90>
--> processing: <F0 9F 90 A2>
received code points: <U+1F422>

score 0 · Answer 3 · answered Jul 19 '15 at 21:09

Here is a way to do it (the script takes the byte sequence as the first command line argument):

use feature qw(say);
use strict;
use warnings;

use Encode;

my @hex = split " ", shift;
my $bytes = join '', map { chr hex } @hex;
my @abytes;
my @achr;
while (1) {
    my $str = decode( 'UTF-8', $bytes, Encode::FB_QUIET );
    if ( length $str > 0 ) {
        for my $char ( split //, $str ) {
            my $bytes = encode( "UTF-8", $char, Encode::FB_CROAK | Encode::LEAVE_SRC);
            push @abytes, $bytes;
            push @achr, $char;
        }
    }
    last if length $bytes == 0;
    push @abytes, substr $bytes, 0, 1;
    push @achr, chr 0xfffd;
    $bytes = substr $bytes, 1;
}

my $fmt = '%-20s%-20s';
say sprintf $fmt, qw(UTF8 UNICODE);
say "-" x 40;
for my $char ( @achr ) {
    my $bytes = shift @abytes;
    my $str1 = join ' ', map { sprintf '%X', ord $_} split //, $bytes;
    my $str2 = sprintf '%08X', ord $char;
    say sprintf $fmt, $str1, $str2;
}

You asked for this to work on a stream, but your solution doesn't work for a stream. If you have received a partial character, you can't assume it's invalid as your code does since the next read might complete it. — ikegami, Jul 19 '15 at 22:43

Convert UTF-8 byte stream to Unicode

3 Answers3

Linked