0

Looks like python3 treats regular strings as unicode...

import hashlib
h= hashlib.md5()
h.update ('abcd')

cause the error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Unicode-objects must be encoded before hashing

forcing me to do encode it before:

import hashlib
h= hashlib.md5()
h.update ('abcd'.encode ('ascii', 'replace'))

which is tedious since the structure occurs several dozens of time in the program.

I was wondering if there is an alternative to not use encode everywhere in the program.

ivan_pozdeev
  • 33,874
  • 19
  • 107
  • 152
Miguel Rozsas
  • 397
  • 2
  • 3
  • 9
  • 1
    `h.update (b'abcd')` – Josh Friedlander Jun 11 '22 at 21:18
  • 1
    Probably not the optimal solution. But you can use `b'abcd'` instead of the string directly. – Mohamed Yasser Jun 11 '22 at 21:18
  • 3
    Python (3.x) does not "treat" regular strings "as Unicode"; Unicode **is the standard** that explains **what strings are**. The first version of the Unicode standard came out **over thirty years ago**; expecting to work with text on a 1 character = 1 byte basis is inexcusable today. "I was wondering if there is an alternative to not use encode everywhere in the program." Write your own function to wrap up those steps together. Alternately, if it is a literal, then don't have a string in the first place; have a `bytes` object - since you apparently have specific bytes that you wish to hash. – Karl Knechtel Jun 11 '22 at 21:18
  • 1
    Thankfully, it does. https://docs.python.org/3/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit *" All text is Unicode; "* – DeepSpace Jun 11 '22 at 21:19
  • thank you all to the help. Yes, I have to treat it as b'simple string' and not as unicode strings which is the norm. – Miguel Rozsas Jun 11 '22 at 21:25

2 Answers2

1

You can define a byte literal using the b prefix. i.e.

import hashlib

h = hashlib.md5()
h.update(b"abcd")

References

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
  • 1
    **If** this answers the question (and it is not quite clear to me how OP expects things to work, or the *context* of where the data comes from), then I would argue that the question is a duplicate of https://stackoverflow.com/questions/61780683/python-byte-literal. – Karl Knechtel Jun 11 '22 at 21:22
  • I could go either way with it. My thinking is that this is more specifically about how to prevent calling `.encode("ascii")` on a unicode literal. The title should probably be updated to reflect that. – Joshua Taylor Eppinette Jun 11 '22 at 21:25
0

In Python 3, strings are unicode. From the data model in the documentation:

A string is a sequence of values that represent Unicode code points.

Additionally, hashlib hashes expect bytes-like objects to be used for initialization and updates.

So next to encode you can use:

  • Bytes literals b"abcd".
  • The built-in function bytes, e.g. bytes("abcd", "ascii") in case you can't use literals.
Roland Smith
  • 42,427
  • 3
  • 64
  • 94