Why hash results for ASCII differs when apply utf8 functions

Question

I have code:

push my( @list ), $x;
utf8::upgrade( $tmp =  $x ); push @list, $tmp;
utf8::downgrade( $tmp = $x ); push @list, $tmp;
push @list, Encode::encode_utf8( $x );
push @list, Encode::decode_utf8( $x );

print Digest::SHA::hmac_sha256_hex( $_ ), "\n" for @list


d9fa76e37bfe94cfcb0011cf070316775e52845021ee92d9bebe8ef289f87e16
d9fa76e37bfe94cfcb0011cf070316775e52845021ee92d9bebe8ef289f87e16
d9fa76e37bfe94cfcb0011cf070316775e52845021ee92d9bebe8ef289f87e16
d9fa76e37bfe94cfcb0011cf070316775e52845021ee92d9bebe8ef289f87e16
d9fa76e37bfe94cfcb0011cf070316775e52845021ee92d9bebe8ef289f87e16

Why when $x is фыва forth hash is different and the programm is crashed:

09165674df9a2eada20acb972bbf71d4cb5637b152d84568fd2e8fcbe9d61188
09165674df9a2eada20acb972bbf71d4cb5637b152d84568fd2e8fcbe9d61188
09165674df9a2eada20acb972bbf71d4cb5637b152d84568fd2e8fcbe9d61188
36cdc4291ac91e26f76a208feb90e8a5a35729d54660bbb63acdb82746f7ec6a
Wide character in subroutine entry at ./t3.pl line 7.

Please give me some light about utf8 magic. Thank you.

UPD

In app I should check data integrity by checking signs. Sometimes data come in UTF8. Before we do not handle that case. Here I am trying to check that sign will not be changed after:

Digest::SHA::hmac_sha256_hex( Encode::encode_utf8( $data ) )

In parallel I check what would be if I apply this or that function to incoming data.

Yeah, I do not understand utf8, so I ask

You are misunderstanding `map` as well as UTF-8 encoding. `map` is a tool for converting one list to another by applying the same rule to every element of the original list. It should not be used for its side effects, such as printing each element of a list, and especially when the returned list is discarded. So `map { print Digest::SHA::hmac_sha256_hex( $_ ), "\n" } @list` should probably be `print Digest::SHA::hmac_sha256_hex( $_ ), "\n" for @list`. — Borodin, Nov 13 '16 at 18:05
Note that `utf8::downgrade()` will fail if argument cannot be represented in ASCII. So it will fail for cyrillic characters. Also `Digest::SHA::hmac_sha256_hex()` requires argument as bytes, so it will fail for wide characters. — Håkon Hægland, Nov 13 '16 at 18:13
@HåkonHægland Strange, but `utf8::downgrade()` do not fail for cyrillic characters as you can see. — Eugen Konkov, Nov 13 '16 at 18:49
I can't understand why [***Sinan Ünür***](http://stackoverflow.com/users/100754/sinan-%C3%9Cn%C3%BCr) unilaterally closed this question. It is about `utf8::downgrade`, whereas the post that is supposed to be ***identical*** uses only `Digest::MD5` and `binmode`. There is an underlying commonality, but not one that most people who need an answer to this question would understand. This question should never have been closed. — Borodin, Nov 13 '16 at 18:53
@HåkonHægland: The restriction is that all characters must be representable in *eight bits* in the *current encoding*. Clearly half of that includes ASCII, but characters from 0x80 to 0xFF are valid but dependent on the current locale. — Borodin, Nov 13 '16 at 18:57
@EugenKonkov: This is why people use an encoding (usually UTF-8) of Unicode: eight bits aren't usually sufficient to represent all the characters required for international communication, and no languages that I know provide for individual "Extended ASCII" characters to be associated with a specific encoding. — Borodin, Nov 13 '16 at 19:00

Borodin · Answer 1 · 2016-11-13T18:59:06.653

4

The most important reason is that you haven't understood what utf8::downgrade does. Take a look at utf8 utility functions

If you had

use strict;
use warnings 'all';

in place at the top of your code, you would have seen the message

Wide character in subroutine entry

for the line

utf8::downgrade( $tmp = $x )

The documentation tells us about utf8::downgrade

Converts in-place the internal representation of the string from UTF-8 to the equivalent octet sequence in the native encoding (Latin-1 or EBCDIC)

Your character string starts with ф, which is Unicode U+0444 or CYRILLIC SMALL LETTER EF. There is no equivalent in Latin-1 or EBCDIC, so your code generates an error that you don't handle

You don't say what you're trying to do, but it's most likely that you need to use the Encode module, which will convert between most popular character encodings

edited Nov 13 '16 at 18:59

answered Nov 13 '16 at 18:44

Borodin

126,100
9
70
144

how does perl knows that string should be characters and not octets? – Eugen Konkov Nov 13 '16 at 19:01
1

@EugenKonkov: There is an internal flag that says whether each byte in a string should be treated as an individual character or as part of (an extension of) a UTF-8-encoded multi-byte character. You should leave perl to do the right thing internally, and make sure that all inputs and outputs are decoded and encoded correctly. It is extremely rare that you will need to work with individual bytes of a multi-byte encoding. – Borodin Nov 13 '16 at 19:07
ah. it knows because of `utf8::downgrade` but it reports my line of code because of XS. I should try utf8 symbols not from my locale – Eugen Konkov Nov 13 '16 at 19:25
1

@EugenKonkov: No, that's nonsense. `utf8::downgrade` converts multi-byte UTF-8 characters that represent *eight-bit* character codes to a single byte. As you have seen, it fails if the character code is bigger than eight bits. – Borodin Nov 13 '16 at 22:03
1

@EugenKonkov: No, that's nonsense. Perl doesn't *"know because of `utf8::downgrade`"*: as I said, it keeps an internal flag that says whether or not each string is multi-byte-encoded. `utf8::downgrade` converts multi-byte UTF-8 characters that represent eight-bit character codes to a single byte, and also clears that flag. But, as you have seen, it fails if the character code is bigger than eight bits. You are using characters that are wider than eight bits, so `utf8::downgrade` will do nothing but report the error that you are seeing. – Borodin Nov 13 '16 at 22:15
1

@EugenKonkov: Please use `Encode` in preference to `utf8`. The only proper use of `use utf8` is to tell the perl compiler that the current source file is UTF-8-encoded. – Borodin Nov 13 '16 at 22:17

Why hash results for ASCII differs when apply utf8 functions

1 Answers1