3

I try to convert string to utf8.

#!/usr/bin/perl -w
use Encode qw(encode decode is_utf8);
$str = "\320\300\304\310\323\321 \316\320\300\312\313";
Encode::from_to($str, 'windows-1251', 'utf-8');
print "converted:\n$str\n";

And in this case I get what I need:

# ./convert.pl
converted:
РАДИУС ОРАКЛ

But if I use external variable:

#!/usr/bin/perl -w
use Encode qw(encode decode is_utf8);
$str = $ARGV[0];
Encode::from_to($str, 'windows-1251', 'utf-8');
print "converted:\n$str\n";

Nothing happens.

# ./convert.pl "\320\300\304\310\323\321 \316\320\300\312\313"
 converted:
\320\300\304\310\323\321 \316\320\300\312\313

This is the dump of the first example:

SV = PV(0x1dceb78) at 0x1ded120
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x1de7970 "\320\300\304\310\323\321 \316\320\300\312\313"\0
CUR = 12
LEN = 16

And the second:

SV = PV(0x1c1db78) at 0x1c3c110
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x1c5e7e0 "\\320\\300\\304\\310\\323\\321 \\316\\320\\300\\312\\313"\0
CUR = 45
LEN = 48

I've tried this method:

#!/usr/bin/perl -w
use Devel::Peek;
$str = pack 'C*', map oct, $ARGV[0] =~ /\\(\d{3})/g;
print Dump ($str);

# ./convert.pl "\320\300\304\310\323\321 \316\320\300\312\313"

SV = PV(0x1c1db78) at 0x1c3c110
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x1c5e7e0 "\320\300\304\310\323\321\316\320\300\312\313"\0
CUR = 11
LEN = 48

But again it's not what I need. Could you help me to get the result like in the first script?


After using this

($str = shift) =~ s/\\([0-7]+)/chr oct $1/eg

as suggested by Borodin, I get this

SV = PVMG(0x13fa7f0) at 0x134d0f0
  REFCNT = 
  FLAGS = (SMG,POK,pPOK)
  IV = 0
  NV = 0
  PV = 0x1347970 "\320\300\304\310\323\321 \316\320\300\312\313"\0
  CUR = 12
  LEN = 16
  MAGIC = 0x1358290 
    MG_VIRTUAL = &PL_vtbl_mglob
    MG_TYPE = PERL_MAGIC_regex_global(g)
    MG_LEN = -1
Borodin
  • 126,100
  • 9
  • 70
  • 144
Voland Kem
  • 33
  • 1
  • 3
  • 1
    What exactly is your input format? Can it contain any special sequences other than backslash + octal number? – melpomene Oct 14 '15 at 03:42
  • When you `pack` you have to escape the escape. The input should be `"\\320\\300\\304\\310\\323\\321 \\316\\320\\300\\312\\313"` Also if you want the space, you need to use the octal space char (040), `"\\320\\300\\304\\310\\323\\321\\040\\316\\320\\300\\312\\313"` – JRD Oct 14 '15 at 04:28
  • You say your third example is *"not what I need"* but the string is identical to the first example which you say is correct except that you have removed the space. What exactly is wrong with it? – Borodin Oct 14 '15 at 05:11
  • Does this work? `($str = shift) =~ s/\\([0-7]+)/chr oct $1/eg` – Borodin Oct 14 '15 at 05:14
  • @melpomene input format is string in utf-8, but this string contains of symbols from [Windows-1251 table](https://en.wikipedia.org/wiki/Windows-1251) – Voland Kem Oct 14 '15 at 07:33
  • @JRD It does not work because it's just a string, not correct symbols in cp1251. At first it must be converted to symbols. – Voland Kem Oct 14 '15 at 07:45
  • @Borodin In the first example LEN=16, in the third LEN=48. The strings are not identical. – Voland Kem Oct 14 '15 at 07:48
  • @Borodin I've tried your method. The result is SV = PVMG(0x13fa7f0) at 0x134d0f0 REFCNT = 1 FLAGS = (SMG,POK,pPOK) IV = 0 NV = 0 PV = 0x1347970 "\320\300\304\310\323\321 \316\320\300\312\313"\0 CUR = 12 LEN = 16 MAGIC = 0x1358290 MG_VIRTUAL = &PL_vtbl_mglob MG_TYPE = PERL_MAGIC_regex_global(g) MG_LEN = -1 If I try to convert it to utf-8 using Encode::from_to($str, 'utf-8', 'windows-1251');, I get converted: ?????? ????? – Voland Kem Oct 14 '15 at 07:48
  • It's ugly, but it solves the problem :D $str = $ARGV[0]; $str = `perl -e 'print "$str"' | iconv -f windows-1251`; print "converted: $str\n"; – Voland Kem Oct 14 '15 at 08:56
  • ***"input format is string in utf-8, but this string contains of symbols from Windows-1251 table"*** Are you saying that someone's typing using backslashes and octal digits to enter a Windows-1251-encoded string on a UTF-8 terminal? – Borodin Oct 14 '15 at 09:10
  • Note that your first example `Encode::from_to($str, 'windows-1251', 'utf-8')` is also wrong, because the result is an *encoded byte string*. It just so happens that those bytes correspond to UTF-8 encoding, which is what your output device is expecting. But if you try to do character operations on your string (for instance `split` or `substr`) then you will find that you're dealing with individual bytes instead of characters. You just need to skip the *reencoding* part, and write `Encode::decode('Windows-1251', $str)` which will leave you with a Perl character string – Borodin Oct 14 '15 at 09:16
  • @Borodin **"Are you saying that someone's typing using backslashes and octal digits to enter a Windows-1251-encoded string on a UTF-8 terminal?"** Yes, you can say so. Freeradius recieves these characters from Oracle, which has cyrillic usernames in cp1251. Thank you for your help :) – Voland Kem Oct 14 '15 at 10:16
  • @VolandKem: So does the data from Oracle contain backslashes, or just CP-1251-encoded characters? – Borodin Oct 14 '15 at 12:26
  • @Borodin Yes, It contains backslashes and octal digits. One more method to convert characters to cp1251: $str = pack "C*", map oct()? oct : 32, $ARGV[0] =~ / \d{3} | \s /gx; – Voland Kem Oct 14 '15 at 14:55

3 Answers3

6

It's not clear exactly what input you're getting or where from, or what you want your output to be, but you shouldn't be encoding your data into UTF-8 for use within the program because you want to deal with characters and not encoded bytes. You should just decode it from whatever external encoding is being sent to the program and work with it like that

It sounds like the input is Windows-1251 and the output is UTF-8 (?) and I assume the backslashes are a distraction. There are no backslashes in the file or typed on the keyboard are there? So changing the base to hex for clarity, your input string is like this

"\xD0\xC0\xC4\xC8\xD3\xD1\x20\xCE\xD0\xC0\xCA\xCB"

and you want to convert it to a Perl character string, do some stuff with it, and print it to the output. If you're on a Linux machine and you want to explicitly decode it from raw input bytes, then you need to write something like this

use utf8;
use strict;
use warnings;
use feature 'say';

use open qw/ :std OUT :encoding(UTF-8) /;
use Encode qw/ decode /;

my $str = "\xD0\xC0\xC4\xC8\xD3\xD1\x20\xCE\xD0\xC0\xCA\xCB";

$str = decode('Windows-1251', $str);

say $str;

output

РАДИУС ОРАКЛ

But that's a contrived situation. The string is actually coming from an input stream, so it's better to set the encoding of the stream and forget about manual decoding. You can use binmode if you're reading from STDIN, like this

binmode STDIN, 'encoding(Windows-1251)';

and then text input from STDIN will be converted implicitly from Windows-1251-encoded bytes to a character string. Alternatively, if you're opening a file on your own handle, you can put the encoding in the open call

open my $fh, '<:encoding(Windows-1251)', $file or die $!;

and then you don't need to add a binmode either

As I said, I've assumed your output is UTF-8, and in the program above the line

use open qw/ :std OUT :encoding(UTF-8) /;

sets all output file handles to have a default of UTF-8 encoding. The :std also sets the built-in handles STDOUT and STDERR to UTF-8. If this isn't what you want and you can't figure out how to set it up as you need it then please do ask

Borodin
  • 126,100
  • 9
  • 70
  • 144
0

think about this:

$ perl -le 'print length("\320\300\304\310\323\321 \316\320\300\312\313")'
12

$ perl -le 'print length($ARGV[0])' "\320\300\304\310\323\321 \316\320\300\312\313"
45

here we recieve the number of characters in given string. pay attention that when string is inside perl script, perl interprets backslashed symbols according to their codes. but if backslashed symbols are outside perl script, the are just shell symbols and shell doesn't interpret them somehow and so you get exactly what you give.

0

A couple of simple methods to convert backslashes and octal digits typed in utf-8 terminal to cp1251:

$str = perl -e 'print "$ARGV[0]"' | iconv -f windows-1251;
print $str;

or

$str = pack "C*", map oct()? oct : 32, $ARGV[0] =~ / \d{3} | \s /gx;
print $str;
Voland Kem
  • 33
  • 1
  • 3