Checksumming filepaths that may not be ascii

Question

Let's say I have two filepaths:

/my/file/path.mov
/mé/fileé/pathé.mov

If I do something like:

{hashlib.md5(path).hexdigest() for path in paths}

Then I'll sometimes get the following error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xc1' in position 217: ordinal not in range(128)

My quickfix was something along the lines of:

{hashlib.md5(path).hexdigest() for path in paths if path.isascii()}

But what would be a better way to deal with this?

Must be Python 2 (the `u'\xc1'` gives it away). It defaults encoding to `ascii` if you don't do it yourself. Python 3 *requires* encoding to a byte string if you start with a Unicode string because you can only hash bytes. — Mark Tolonen, Aug 20 '21 at 21:16
Does this answer your question? [UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)](https://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20) — msanford, Aug 20 '21 at 21:17

score 1 · Answer 1 · answered Aug 20 '21 at 21:08

1

You need to provide an encoding yourself. In full generality, you can use UTF-8.

hashlib.md5(path.encode("utf-8"))

answered Aug 20 '21 at 21:08

Silvio Mayolo

score 1 · Accepted Answer · answered Aug 20 '21 at 21:12

1

The encoding that you have to give it is missing. utf -... followed by the number of the encode you want to use ...

Normally it should be fine like this:

hashlib.md5(path.encode("utf-8"))

answered Aug 20 '21 at 21:12

Piero

2 Answers2