1

I want to substitude a substring with a hash - said substring contains non-ascii caracters, so I tried to encode it to UTF-8.

result = re.sub(r'(Start:\s*)([^:]+)(:\s*)([^:]+)', lambda m: m.group(1) + m.group(2) + m.group(3) + hashlib.sha512(m.group(4).encode()).hexdigest(), line.encode('utf-8'))

I am not realy sure why this doesn't work, I thought with line.encode('utf-8'), the whole string is getting encoded. I also tried to encode my m.groups to UTF-8, but I got the same UnicodeDecodeError.

[unicodedecodeerror: 'ascii' codec can't decode byte in position ordinal not in range(128)]

Sample input:

Start: myUsername: myÜsername:

What am I missing ?

EDIT_

Traceback (most recent call last):
  File "C:/Users/Peter/Desktop/coding/filter.py", line 26, in <module>
    encodeline = line.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 112: ordinal not in range(128)
peacemaker
  • 71
  • 1
  • 10
  • Could you please post the example input and the stacktrace of the error you mention? (generally, your question does not seem to be [MCVE](https://stackoverflow.com/help/mcve)). – sophros Oct 26 '18 at 12:33
  • your are right, sorry - I put in some more information – peacemaker Oct 26 '18 at 12:37
  • Is this Python 2 or Python 3 code? I strongly suspect your problem is that you're running on Python 2, and trying to `encode` a `str` (which is a largely nonsensical thing to do). A full traceback and a [MCVE] would be helpful. Lastly, to be sure, split up the line so you only `encode` once per line, e.g. `encodedline = line.encode('utf-8')`, then replace `line.encode('utf-8')` in the `re.sub` with `encodedline` so you aren't able to confuse which `encode` is the problem. – ShadowRanger Oct 26 '18 at 13:00
  • I am running python 2.7 - is there a way to solve this problem or should I go with the "hack" ? – peacemaker Oct 26 '18 at 13:02
  • @peacemaker: The hack is a bad idea (`setdefaultencoding` is deleted from `sys` after calling it for a reason; changing the default mid-run risks all sorts of problems from various libraries that may have cached the encoding, or the results of encoding things in it, and suddenly find that things aren't behaving the way they did at startup). I strongly suspect your code will work by deleting all calls to `encode` in that line; you already *had* UTF-8 encoded data, so trying to `encode` it again was the source of your problems. See [my answer](https://stackoverflow.com/a/53009825/364696). – ShadowRanger Oct 26 '18 at 13:29

3 Answers3

1

Based on your symptoms, you're running on Python 2. Calling encode on a Python 2 str is almost always nonsensical.

You have two problems; one you're hitting now, and one you'll hit if you fix your current code.

Your first problem is line is already a str in (apparently) UTF-8 encoded bytes, not unicode, so encodeing it implicitly decodes with Python's default encoding (ASCII; this isn't locale specific to my knowledge, and it's a rare Python 2 install that uses anything else), then re-encodes with the specified codec (or the default if not specified). Basically, line was already UTF-8 encoded, you told it to encode again as UTF-8, but that's nonsensical, so Python tried to decode as ASCII first, and failed before it even tried to encode as you instructed.

The solution to this problem is to just not encode line at all; it's already UTF-8 encoded, so you're already golden.

Your second problem (which you haven't encountered yet, but you will) is that you're calling encode on the group(4) result. But of course, since the input was a str, the group is a str too, and you'll encounter the same problem trying to encode a str; since the group came from raw UTF-8 encoded bytes, the non-ASCII parts of it cause a UnicodeDecodeError during the implicit decode step before the encode.

The reason:

import sys

reload(sys)
sys.setdefaultencoding('UTF8')

works is that it (dangerously) changes the implicit decode step to use UTF-8, so all your encode calls now perform the implicit decode with UTF-8 instead of ASCII; the decode and encode is mostly pointless, since all it does is return the original str after confirming it's legal UTF-8 by means of decodeing it as such, and otherwise acting as an expensive no-op.

To fix the second problem, just change:

m.group(4).encode()

to:

m.group(4)

That leaves your final code as:

result = re.sub(r'(Start:\s*)([^:]+)(:\s*)([^:]+)',
                lambda m: m.group(1) + m.group(2) + m.group(3) + hashlib.sha512(m.group(4)).hexdigest(),
                line)

Optionally, if you want to confirm your expectation that line is in fact UTF-8 encoded bytes already, add the following above that re.sub line:

try:
    line.decode('utf-8')
except Exception as e:
    sys.exit("line (of type {!r}) not decodable as UTF-8: {}".format(line.__class__.__name__, e))

which will cause the program to exit immediately if the data given is not legal UTF-8 (and will also let you know what type line is, so you can confirm for sure if it's really str or unicode, since str implies you chose the wrong codec, while unicode means your inputs aren't of the expected type).

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
0

I found .. in my eyes a workaround. Doesn't feel right though, but it does the job.

import sys

reload(sys)
sys.setdefaultencoding('UTF8')

I thought it could be done with .encode('utf-8')

peacemaker
  • 71
  • 1
  • 10
0
file = 'xyz'
res = hashlib.sha224(str(file).encode('utf-8)).hexdigest()

Because of unicode object must be encode as string before hash.

Park
  • 2,446
  • 1
  • 16
  • 25
poornesh
  • 11
  • 1