1

I have this string (Decimal NCRs): 日本の鍼灸とは

It represents the Japanese text 日本の鍼灸とは.

But I need (UTF-8): %E6%97%A5%E6%9C%AC%E3%81%AE%E9%8D%BC%E7%81%B8%E3%81%A8%E3%81%AF

For the first character: 日%E6%97%A5

This site does it, but how do I get this in Perl? (If possible in a single regex like s/\&\#([0-9]+);/uc('%'.unpack("H2", pack("c", $1)))/eg;.)

http://www.endmemo.com/unicode/unicodeconverter.php

Also I need to convert it back again from UTF-8 to Decimal NCRs

I've been breaking my head over this one for half a day now, any help is greatly appreciated!

ikegami
  • 367,544
  • 15
  • 269
  • 518
Eesger
  • 13
  • 4

2 Answers2

3

What you call "UTF-8" is actually URL-encoding.


HTML entities (日) ⇒ text () ⇒ URI component (%E6%97%A5):

use HTML::Entities qw( decode_entities );
use URI::Escape    qw( uri_escape_utf8 );

my $text = decode_entities($html);
my $uri_component = uri_escape_utf8($text);

URI component (%E6%97%A5) ⇒ text () ⇒ HTML entities (日):

use Encode         qw( decode_utf8 );
use HTML::Entities qw( encode_entities );
use URI::Escape    qw( uri_unescape );

my $text = decode_utf8(uri_unescape($uri_component));
my $html = encode_entities($text);
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • with: #!/usr/bin/perl use strict; use warnings; use HTML::Entities qw( encode_entities ); use URI::Escape qw( uri_escape_utf8 ); my $html = '日'; my $text = decode_entities($html); my $uri_component = uri_escape_utf8($text); print $uri_component."\n"; I get `panic: utf16_to_utf8: odd bytelen 53 at jp.pl line 12.` – Eesger Mar 19 '15 at 13:53
  • 2
    I think that's because your Perl source file is badly encoded using UTF-16. Note that `use HTML::Entities qw( encode_entities );` should be `use HTML::Entities qw( decode_entities );`. – ikegami Mar 19 '15 at 13:59
  • You are right (UTF-16), your first parts is very good, your second parts results to `日本の鍼灸とは` and not to `日本の鍼灸とは` – Eesger Mar 19 '15 at 14:04
0
#!/usr/bin/perl
use strict;
use warnings;

use Test::More tests => 2;
use Encode qw{ encode decode };

my $in = '日本の鍼灸とは'; # 日本の鍼灸とは
my $out = '%E6%97%A5%E6%9C%AC%E3%81%AE%E9%8D%BC%E7%81%B8%E3%81%A8%E3%81%AF';

(my $utf = $in) =~ s/&#(.*?);/chr $1/ge;

my $r = join q(), map { sprintf '%%%2X', ord } split //, encode('utf8', $utf);
is($r, $out);

(my $s = $r) =~ s/%(..)/chr hex $1/ge;
$s = decode('utf8', $s);
$s = join q(), map '&#' . ord . ';', split //, $s;
is($s, $in);
choroba
  • 231,213
  • 25
  • 204
  • 289
  • Thank you for your quick responce, but I can't test your POC (i get errors in all the character-settings for the file I can think of), could you rewrite it with input ($in) `日本の鍼灸とは` – Eesger Mar 19 '15 at 13:29
  • @Eesger, Is your input `日本の鍼灸とは` or `日本の鍼灸とは`? – ikegami Mar 19 '15 at 13:36
  • @ikegami: It was the latter, now it's the former. – choroba Mar 19 '15 at 13:36
  • @choroba Great! almost there, $r is the conversion for getting $in to the result $out! the second part converts to the characters (and validates), but I need the original value you edited (`日` etc.), can you do another update? – Eesger Mar 19 '15 at 13:44