Fix filenames with encoding when unzipping with special characters in python

Question

There are many questions about encoding our there but I still have not been able to solve my problem.

Imagine I have three files within a compressed ZIP file:

Übersicht.pdf finalePräsentation münchen

I want to unzip those files so I do:

with zipfile.ZipFile("path/result.zip", "r") as zip_ref:
    zip_ref.extractall("/path/")

The filenames look like crap:

My research shows that filenames are basically byte-strings and that it is impossible for the OS to see what the encoding is. But I was still wondering if there is any way to rectify the problem with the file names so the german "Umlaute" will be displayed correctly.

I tried to change the encoding like this:

    with zipfile.ZipFile(save_as, "r") as zip_ref:
        print(zip_ref.namelist())
        encoded_strings = [s.encode("utf-8") for s in zip_ref.namelist()]
        print(encoded_strings)
        zip_ref.extractall(dest)

I tried this with latin-1, iso and some other encodings and the byte-strings are in fact interpreted differently, but always cryptic. Thus I am asking the question to see if there is a simple way to fix this.

Thanks very much in advance, help is very much appreciated

EDIT: The output of locale give me the following:

LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"

hexdump of the beginning of the first file reads like this:

0000000 25 50 44 46 2d 31 2e 34 0a 25 93 8c 8b 9e 20 52
0000010 65 70 6f 72 74 4c 61 62 20 47 65 6e 65 72 61 74
0000020 65 64 20 50 44 46 20 64 6f 63 75 6d 65 6e 74 20
0000030 68 74 74 70 3a 2f 2f 77 77 77 2e 72 65 70 6f 72

echo *.pdf | xxd | head gives me this:

00000000: 6669 6e61 6c65 5072 c3a4 7365 6e74 6174  finalePr..sentat
00000010: 696f 6e2e 7064 660a                      ion.pdf.

00000000: 6dc3 bc6e 6368 656e 2e70 6466 0a         m..nchen.pdf.

00000000: c39c 6265 7273 6963 6874 2e70 6466 0a    ..bersicht.pdf.

Please [edit] to provide a hex dump of the actual bytes in these file names. See also https://meta.stackoverflow.com/questions/379403/problematic-questions-about-decoding-errors — tripleee, Sep 24 '21 at 10:25
As requested I updated the question. I compressed the files on my machine (macOS) — shadow, Sep 24 '21 at 10:52
I don't see any update to provide what I requested, though you seem to have updated to provide the output from `locale` (with the wrong command name). We can't tell what bytes are in the file names so there is no way to provide an answer which is correct. — tripleee, Sep 24 '21 at 10:56
Perhaps see also the guidance in the [`character-encoding` tag info page](/tags/character-encoding/info) — tripleee, Sep 24 '21 at 11:25
Apologies and thank you for the corrections and resources. I now updated the question with hopefully the correct information — shadow, Sep 24 '21 at 11:30
The file name, not the contents of the file. `echo *.pdf | xxd | head` — tripleee, Sep 24 '21 at 11:38
Thanks for the command, that saved me alot of time. I updated the question with the information. — shadow, Sep 24 '21 at 11:47
The unaccented character before the random binary garbage in the screenshot threw me off, did you edit those file names by hand? — tripleee, Sep 24 '21 at 12:08

tripleee · Accepted Answer · 2021-09-24T12:06:12.940

Thanks for the hex dump. With the updated data, it seems like the file names are completely run of the mill mojibake using probably code page 1252.

destination_file = filename.encode('cp1252').decode('utf-8')

My original speculation from before you updated your question is preserved below as possibly interesting and / or instructive.

Your screen shots are a bit muddy, but it looks vaguely like the file names are encoded as Windows code page 437.

>>> import unicodedata
>>> unicodedata.normalize('NFKD', "Übersicht.pdf").encode('utf-8')
b'U\xcc\x88bersicht.pdf'

Examining character code 0xcc it translates to the glyph ╠‎ (U+2560) in the encodings cp1125, cp437, cp720, cp737, cp775, cp850, cp852, cp855, cp856, cp857, cp858, cp860, cp861, cp862, cp863, cp865, cp866, and cp869; and 0x88 translates to ê‎ (U+00EA in cp437, cp720, cp850, cp857, cp858, cp860, cp861, cp863, and cp865. There are multiple encodings in the intersection, but 437 was by far the most common back in the days when PKzip was invented.

(╠ is double-stroked, whereas your screen shot looks more like a single-stroked version, but this might be just a matter of font design and/or an unclear picture; and the conclusion is compelling enough that I'm going with this.)

(Disclosure: the links are to a page of my own.)

Assuming this analysis is correct, and assuming the zip library gives you the names as byte strings, you should be able to simply decode them with

destination_file = filename.encode('latin-1').decode('cp437')

The detour over Latin-1 obscurely translates each character code to the corresponding byte value (recall that Latin-1 is compatible with Unicode in the first 256 characters, but is a pure 8-bit character encoding) and so we can then map it back to Unicode by decoding it with the correct codec.

Thanks so much for this detailed answer. This encoding throws me `UnicodeEncodeError: 'latin-1' codec can't encode characters in position 31-32: ordinal not in range(256)` — shadow, Sep 24 '21 at 12:03
I had to guess what bytes you were showing us and the guess was slightly off. See update now. — tripleee, Sep 24 '21 at 12:04

score 1 · Answer 2 · answered Sep 24 '21 at 10:37

1

If you don't find the original encoding, you can always try to fall back to ascii with:

[unicodedata.normalize('NFKD', s).encode('ascii', 'ignore') for s in zip_ref.namelist()]

using the built-in lib unicodedata

answered Sep 24 '21 at 10:37

Hugo-C

11
1

Thanks @Hugo-C! That works great, transforming the Umlaute to normalized names. But let's say those strings are all in UTF-8, would there be a way to actually get the Umlaute in the files? – shadow Sep 24 '21 at 10:50
if you know the encoding you can use encode/decode: `b'\xc3\xa9\xc3\xa9\xc3\xa9'.decode(encoding='utf-8', errors='strict') == 'ééé'` – Hugo-C Sep 24 '21 at 11:51

Fix filenames with encoding when unzipping with special characters in python

2 Answers2