Perl: Convert (high) decimal NCR to UTF-8

Question

I have this string (Decimal NCRs): 日本の鍼灸とは

It represents the Japanese text 日本の鍼灸とは.

But I need (UTF-8): %E6%97%A5%E6%9C%AC%E3%81%AE%E9%8D%BC%E7%81%B8%E3%81%A8%E3%81%AF

For the first character: 日 ⇒ 日 ⇒ %E6%97%A5

This site does it, but how do I get this in Perl? (If possible in a single regex like s/\&\#([0-9]+);/uc('%'.unpack("H2", pack("c", $1)))/eg;.)

http://www.endmemo.com/unicode/unicodeconverter.php

Also I need to convert it back again from UTF-8 to Decimal NCRs

I've been breaking my head over this one for half a day now, any help is greatly appreciated!

ikegami · Answer 1 · 2015-03-19T13:53:37.537

3

What you call "UTF-8" is actually URL-encoding.

HTML entities (日) ⇒ text (日) ⇒ URI component (%E6%97%A5):

use HTML::Entities qw( decode_entities );
use URI::Escape    qw( uri_escape_utf8 );

my $text = decode_entities($html);
my $uri_component = uri_escape_utf8($text);

URI component (%E6%97%A5) ⇒ text (日) ⇒ HTML entities (日):

use Encode         qw( decode_utf8 );
use HTML::Entities qw( encode_entities );
use URI::Escape    qw( uri_unescape );

my $text = decode_utf8(uri_unescape($uri_component));
my $html = encode_entities($text);

edited Mar 19 '15 at 13:53

answered Mar 19 '15 at 13:28

ikegami

367,544
15
269
518

with: #!/usr/bin/perl use strict; use warnings; use HTML::Entities qw( encode_entities ); use URI::Escape qw( uri_escape_utf8 ); my $html = '日'; my $text = decode_entities($html); my $uri_component = uri_escape_utf8($text); print $uri_component."\n"; I get `panic: utf16_to_utf8: odd bytelen 53 at jp.pl line 12.` – Eesger Mar 19 '15 at 13:53
2

I think that's because your Perl source file is badly encoded using UTF-16. Note that `use HTML::Entities qw( encode_entities );` should be `use HTML::Entities qw( decode_entities );`. – ikegami Mar 19 '15 at 13:59
You are right (UTF-16), your first parts is very good, your second parts results to `日本の鍼灸とは` and not to `日本の鍼灸とは` – Eesger Mar 19 '15 at 14:04

choroba · Accepted Answer · 2015-03-19T13:39:15.677

0

#!/usr/bin/perl
use strict;
use warnings;

use Test::More tests => 2;
use Encode qw{ encode decode };

my $in = '&#26085;&#26412;&#12398;&#37756;&#28792;&#12392;&#12399;'; # 日本の鍼灸とは
my $out = '%E6%97%A5%E6%9C%AC%E3%81%AE%E9%8D%BC%E7%81%B8%E3%81%A8%E3%81%AF';

(my $utf = $in) =~ s/&#(.*?);/chr $1/ge;

my $r = join q(), map { sprintf '%%%2X', ord } split //, encode('utf8', $utf);
is($r, $out);

(my $s = $r) =~ s/%(..)/chr hex $1/ge;
$s = decode('utf8', $s);
$s = join q(), map '&#' . ord . ';', split //, $s;
is($s, $in);

edited Mar 19 '15 at 13:39

answered Mar 19 '15 at 13:19

choroba

231,213
25
204
289

Thank you for your quick responce, but I can't test your POC (i get errors in all the character-settings for the file I can think of), could you rewrite it with input ($in) `日本の鍼灸とは` – Eesger Mar 19 '15 at 13:29
@Eesger, Is your input `日本の鍼灸とは` or `日本の鍼灸とは`? – ikegami Mar 19 '15 at 13:36
@ikegami: It was the latter, now it's the former. – choroba Mar 19 '15 at 13:36
@choroba Great! almost there, $r is the conversion for getting $in to the result $out! the second part converts to the characters (and validates), but I need the original value you edited (`日` etc.), can you do another update? – Eesger Mar 19 '15 at 13:44

Perl: Convert (high) decimal NCR to UTF-8

2 Answers2

Linked