3

What is the standard test in Perl to determine if a value is a sequence of bytes or an encoded string of characters? And if it's an encoded string, what character encoding is it in?

Let's assume the following complete Perl script:

'foo';

How would one determine if this literal string is a sequence of bytes or a string of characters in some encoding? And if it's a string of characters in some character encoding, what character encoding is it in?

This question is not about Unicode or UTF-8. It's about bytes versus characters in Perl generally. This question is also not about automated character encoding detection, which is a different topic entirely.

UPDATE

After initializing $letter, I want Perl to tell me what character encoding it thinks the letter stored in the variable $letter is in. I don't expect it necessarily to be right. Ensuring that Perl's understanding of what character encoding the letter is in is my responsibility as the programmer. I get that. But there should be a simple, easy way to test what character encoding Perl thinks a character (or string of characters) is in. Isn't there?

C:\>perl -E "$letter = 'Ž'; say $letter =~ m/\w/ ? 'matches' : 'does not match'"
does not match

C:\>perl -MEncode -E "$letter = decode('UTF-8', 'Ž'); say $letter =~ m/\w/ ? 'matches' : 'does not match'"
does not match

C:\>perl -MEncode -E "$letter = decode('Windows-1252', 'Ž'); say $letter =~ m/\w/ ? 'matches' : 'does not match'"
matches

C:\>perl -MEncode -E "$letter = decode('Windows-1252', 'Ž'); $letter = encode('Windows-1252', $letter); say $letter =~ m/\w/ ? 'matches' : 'does not match'"
does not match

C:\>chcp
Active code page: 1252

C:\>

Can't Perl report on demand what character encoding it understands (rightly or wrongly) the value stored in $letter is in?

Jim Monty
  • 143
  • 2
  • 11
  • Please read http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default/6163129#6163129 – innaM Jul 08 '13 at 08:48
  • @innaM You may be interested to read my [recent post](http://www.perlmonks.org/?node_id=1042970) on PerlMonks about this same Stack Overflow question and its many fine answers. I've read it many times. – Jim Monty Jul 08 '13 at 17:25

5 Answers5

7

Unlike some other programming languages, such as Python, Perl does not make a distinction between "byte strings" and "Unicode strings". All strings have Unicode semantics, as well as byte semantics.

That being said, there is a purely internal distinction made between strings which contain ASCII, ISO8859-1, or binary data, and strings which contain Unicode data. This distinction is made using the UTF8 flag, which can be checked using the utf8::is_utf8() function. However, keep in mind that this flag is set and cleared automatically -- for instance, appending a non-ISO-8859-1 character (say, ) to a string will reencode any data in the string as UTF-8, if necessary, and set the UTF8 flag. This conversion is invisible to pure-Perl programs, though, so you should rarely need to look at it.

If you have a non-Unicode string (e.g, binary data) and you need to figure out what encoding it is in, see How can I guess the encoding of a string in Perl?.

Community
  • 1
  • 1
  • 1
    I know about `utf8::is_utf8()` and `Encode::is_utf8()`, and I know they are functions that report the state of a **purely internal** flag. My question isn't related to any aspect of Perl's internal representation of strings. I'm asking very specifically what the standard test is for determining whether Perl will use character semantics or byte semantics for any given string. You say "Perl does not make a distinction between 'byte strings' and 'Unicode strings,'" but this isn't the case at all, else what are `Encode::decode()` and `Encode::encode()` for? – Jim Monty Jul 08 '13 at 03:01
  • The documentation of [`Encode::find_encoding`](http://search.cpan.org/~dankogai/Encode/Encode.pm#find_encoding) doesn't suggest there's any way to use it to determine what encoding a non-Unicode string (i.e., binary data) is in. Given the name of an encoding, it returns an _object_ corresponding to the encoding with that name. (I can't figure out for what purpose a Perl programmer would use this function.) – Jim Monty Jul 08 '13 at 03:22
  • @Jim Monty, there's no such thing as "byte semantics" and "character semantics". Perl functions that deal with text (e.g. `uc`, `m//`) always expect strings of Unicode code point. There are bugs left in for historical reasons that cause `\s` to sometimes match NBSP and sometimes not. (Similar for `\w` and letters in U+0080..U+00FF.) This is based on the result of `is_utf8`. – ikegami Jul 08 '13 at 04:08
  • @ikegami, What encoding does a string have when returned by Cwd::getcwd or any other functions depending on the default encoding of operating system and, especially, Windows OS? And what encoding does Perl thinks of in such cases? Then, why does a concatenation of these strings with a literal string in UTF8 return a string in corrupted encoding? – Aleksey F. Feb 06 '21 at 12:54
  • @Aleksey F., Windows: The encoding returned by `"cp".Win32::GetACP()`. See Win32::LongPath for versions that return decoded text not limited to any code page. /// Unix: File names are arbitrary sequence of bytes which could have any encoding, or might not be text at all. – ikegami Feb 06 '21 at 15:01
  • @ikegami, thanks. And why does Perl ignore this encoding when concatenates the returned directory path with another string in UTF8? This leads `File::Spec::Funtcions` to corrupted paths and then to the error `file was not found`. This happens if someone does not handle the encoding of any paths. The worst thing is the need of reopening `:std` in the proper encoding (using either `Win32::GetACP` or `Win32::GetConsoleOutputCP` or else), otherwise stdoutput displays strings in corrupted encoding. The described behavior proves that Perl's expectation on stings of Unicode code points is wrong. – Aleksey F. Feb 18 '21 at 17:21
  • Re "*And why does Perl ignore this encoding when concatenates the returned directory path with another string in UTF8?*", F::S::F concatenates what you provide. If it concatenating incompatible things, it's because you provided incompatible things. – ikegami Feb 18 '21 at 19:26
0

There is no file that isn't encoded. The Perl programming language assumes that a source file is in Latin-1 or something. This is a single-byte encoding, so there is a 1:1 mapping between characters and octets. This means that in a file saved with UTF-8 encoding,

length("ø") == 2 and
"ø" eq "\xc3\xb8" and
"ø" ne "\N{LATIN SMALL LETTER O WITH STROKE}"

all of which are not true under use utf8.

In Perl, every string effectively is a sequence of codepoints. Without any decoding steps in the way, every octet will be seen as one codepoint, as demonstrated above. This holds for both string literals in your source file, and IO operations without PerlIO layers.


De- and Encode

The encode function takes a string of codepoints and encodes them with a specified encoding. E.g.

use utf8;
use Test::More; use Encode;

# "is" tests for string equality, "isnt" is the negation

my $str = "ø";
isnt $str, "\xc3\xb8", "String is unencoded";
is length($str), 1,    "Unencoded char has length 1";

my $encoded = encode "UTF-8", $str;
is $encoded, "\xc3\xb8", "The string is properly encoded";
is length($encoded), 2,  "Encoding may map a codepoint to multiple bytes";

This emits a string of bytes, which are represented as codepoints in the range 0x00–0xFF. The encoded string doesn't have an encoding that could be queried; you, the programmer, have to know it. Because it is just a normal string, we could encode it again:

my $double_encoded = encode "UTF-8", $encoded;
is $double_encoded, "\xc3\x83\xc2\xb8", "Double encoding works without type error";

The decode function takes a string of codepoints in the byte range (aka byte string) and transforms it according to the rules of the respective encoding. So:

is decode("utf8", $double_encoded), $encoded, "Decoding works";
is decode("utf8", $encoded),        $str,     "Decoding works 2";

It reverses the encoding step, thereby possibly mapping multiple byte-ranged characters to a single codepoint.

done_testing;
amon
  • 57,091
  • 2
  • 89
  • 149
  • 2
    Understood. So what is the standard _test_ in Perl to determine whether a value has byte semantics or character semantics, and if it has character semantics, what character encoding is it in? (I'm looking for a function.) – Jim Monty Jul 07 '13 at 23:57
  • There is no such thing: every string is considered a sequence of code points. It just happens that you can treat it as a byte string if all CPs are ≤ 0xFF. So `/[^\x00-\xFF]/` might be a start. If you need to know if a string is a bytestring or stringstring, your IO formats might be dubious. – amon Jul 08 '13 at 00:01
  • 1
    In Perl, _no_ string is considered a sequence of code points unless and until the programmer does something explicit to ensure that Perl treats the string as code points (i.e., characters) rather than as bytes. For example, a programmer must use a specific I/O layer (e.g., `:encoding(Windows-1252)`) or `Encode::decode()` or some other mechanism. My question is, how do you test the state of Perl's current _understanding_ of a string? Is there not a built-in function to know whether a string has been decoded yet or not? – Jim Monty Jul 08 '13 at 03:11
  • @Jim Monty, That's not true. In Perl, every string is considered a sequence of code points if it's passed to a `uc`, `m//` or another function that deals with text. It's your job to make it so. String functions (e.g. `substr`, `ord`) don't assign any meaning whatsoever. – ikegami Jul 08 '13 at 04:09
  • @JimMonty I added a de- and encode example to my answer. Again, there is no semantic difference between bytestrings and strings, all strings are just naked codepoints in all their unencoded glory. Encoding layers specify a certain string transformation. Some *treat* the input as byte strings and emit a string containing higher code points. – amon Jul 08 '13 at 07:17
0

It's about bytes versus characters in Perl generally.

That makes no sense. Each element of a string is a character by definition, so it's definitely a string of characters.

The characters can also be bytes (8-bit values). It's not an either-or thing.

How would one determine if this literal string is a sequence of bytes or a string of characters in some encoding?

You have a string consisting of the characters 66, 6F and 6F. How is Perl suppose to know what those values represent? Are they Unicode code points? Are they HTML encoded using UTF-8? Are they configuration files using UTF-8? Are they temperature sensor measurements? It has no way of knowing. They are simply three values.

ikegami
  • 367,544
  • 15
  • 269
  • 518
  • 1
    As I explained in my original post, I'm not asking a question about character encoding detection. I simply want to know what the standard test is for determining if something is bytes or characters in Perl's view. And Perl very _definitely_ has a view. – Jim Monty Jul 08 '13 at 03:50
  • And I explained that's impossible. Perl has no way to know whether 66 is a byte or not (unless you want `/^[\x00-\xFF]*\z/`. And I've added a bit that sayss it's always going to be a character by definition. – ikegami Jul 08 '13 at 03:52
  • Re "Perl very definitely has a view.", No, Perl assigns no semantics to the elements of strings. – ikegami Jul 08 '13 at 03:54
  • Please read [Byte and Character Semantics](http://perldoc.perl.org/perlunicode.html#Byte-and-Character-Semantics) in [`perlunicode`](http://perldoc.perl.org/perlunicode.html) for a better understanding of Perl's byte and character semantic model. – Jim Monty Jul 08 '13 at 04:15
  • I beg to differ. You should read the replies from amon and I for a better understanding. I'm sure you didn't come here to get the docs quoted to you. – ikegami Jul 08 '13 at 04:19
0

"Ž" in cp1252 is 8E, so what you perceive as 'Ž' is the same as chr(0x8E).

Keeping that and the following in mind,

decode('UTF-8', chr(0x8E))     ===   chr(0xFFFD)  [Invalid UTF-8]
decode('cp1252', chr(0x8E))    ===   chr(0x17D)
encode('cp1252', chr(0x17D))   ===   chr(0x8E)
  1. Your first snippet passes 0x8E to the match operator. U+008E (SINGLE SHIFT TWO) is not a "word" code point.

    What you are seeing is the effect of passing something other than Unicode code points (cp1252-encoded text) to an operator expecting Unicode code points.

  2. Your second snippet passes 0xFFFD to the match operator. U+FFFD (REPLACEMENT CHARACTER) is not a "word" code point.

    What you are seeing is the effect of passing something other than UTF-8-encoded text (cp1252-encoded text) to a function expecting UTF-8.

  3. Your third snippet passes 0x017D to the match operator. U+017D (LATIN CAPITAL LETTER Z WITH CARON) IS a "word" code point.

  4. Your fourth snippet, like your first snippet, passes 0x8E to the match operator.

    What you are seeing is the effect of passing something other than Unicode code points (cp1252-encoded text) to an operator expecting Unicode code points.

Your update actually demonstrates what previous answers have already told you: The match operator always considers the string to be a string of code points. There's nothing to check, because the behaviour is always the same.

(The passage about "semantics" has no bearing on your update. Correct behaviour is always obtained because of -E.)

ikegami
  • 367,544
  • 15
  • 269
  • 518
-2

Perl lacks a simple way to know what character encoding a string of characters is presumed to be in. It has an internal flag that can be probed to determine if it's own internal representation of the string is UTF-8 or not, but this entirely different than a test to determine the character encoding of a string of characters.

Let us imagine a notional built-in function named encoding(). Here's what it would do:

C:\>perl -E "say encoding 'quick brown fox'"
ISO-8859-1

C:\>perl -E "use utf8; say encoding 'quick brown fox'"
UTF-8

C:\>perl -E "use utf8; say encoding 'γρήγορη καφέ αλεπού'"
UTF-8

C:\>perl -Mutf8 -MEncode -E "say encoding decode('ISO-8859-7', 'γρήγορη καφέ αλεπού')"
ISO-8859-7

C:\>

(The default character encoding is ISO-8859-1, which is also known as Latin 1.)

This really isn't as difficult a question and answer as others have made it seem, which is exactly the point of it. If Perl had a built-in function to report the character encoding assigned to a string of characters, it would serve to make understanding, discussing, and coping with different character encodings a lot easier.

Jim Monty
  • 143
  • 2
  • 11
  • Having strings in different encodings is not useful at all. (e.g. How would you concat two strings?) You should always normalize your inputs and always re-encode your outputs. This is true in all languages. In Perl, this is done by decoding and encoding. – ikegami Jul 08 '13 at 05:17
  • There's nothing in my question or answer about "having strings in different encodings." I know you "should always decode your inputs and always encode your outputs." Everyone who understands this stuff knows that. My question had nothing to do with decoding and encoding. I asked an exceedingly simple question for which there is a trivial answer: Perl lacks a function to report the understood encoding of a string. And it's too bad, because it would be vey helpful. – Jim Monty Jul 08 '13 at 05:20
  • `encoding()` should always return the same value. Strings would not be useful at all otherwise. (e.g. How would you concat two strings?) You should always normalize your inputs and always re-encode your outputs. This is true in all languages. In Perl, this is done by decoding and encoding. – ikegami Jul 08 '13 at 05:23
  • You really should spend two seconds listening instead of nitpicking. No, it's not useful to tag strings with their encoding so Perl can "understand" (report) it to you. There's a reason no language does this. – ikegami Jul 08 '13 at 05:25
  • I'm not nitpicking. You insist strings in Perl don't get tagged with an encoding, but they most assuredly and obviuosly _do_! What else is decoding doing besides tagging a value with an encoding? – Jim Monty Jul 08 '13 at 05:28
  • 2
    It transforms the string so that 8E becomes 17D. It's a simple numerical mapping. Nothing is tagged. // It's no different than `$x **= 2;`. The value is transformed/mapped to another, but no tag ("square") is attached to it. – ikegami Jul 08 '13 at 05:35
  • You're right. Neither question nor answer are difficult. But both are wrong because both make a totally incorrect assumption. – innaM Jul 08 '13 at 08:45