There are many questions about encoding our there but I still have not been able to solve my problem.
Imagine I have three files within a compressed ZIP file:
Übersicht.pdf
finalePräsentation
münchen
I want to unzip those files so I do:
with zipfile.ZipFile("path/result.zip", "r") as zip_ref:
zip_ref.extractall("/path/")
The filenames look like crap:
My research shows that filenames are basically byte-strings and that it is impossible for the OS to see what the encoding is. But I was still wondering if there is any way to rectify the problem with the file names so the german "Umlaute" will be displayed correctly.
I tried to change the encoding like this:
with zipfile.ZipFile(save_as, "r") as zip_ref:
print(zip_ref.namelist())
encoded_strings = [s.encode("utf-8") for s in zip_ref.namelist()]
print(encoded_strings)
zip_ref.extractall(dest)
I tried this with latin-1
, iso
and some other encodings and the byte-strings are in fact interpreted differently, but always cryptic. Thus I am asking the question to see if there is a simple way to fix this.
Thanks very much in advance, help is very much appreciated
EDIT: The output of locale
give me the following:
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
hexdump
of the beginning of the first file reads like this:
0000000 25 50 44 46 2d 31 2e 34 0a 25 93 8c 8b 9e 20 52
0000010 65 70 6f 72 74 4c 61 62 20 47 65 6e 65 72 61 74
0000020 65 64 20 50 44 46 20 64 6f 63 75 6d 65 6e 74 20
0000030 68 74 74 70 3a 2f 2f 77 77 77 2e 72 65 70 6f 72
echo *.pdf | xxd | head gives me this:
00000000: 6669 6e61 6c65 5072 c3a4 7365 6e74 6174 finalePr..sentat
00000010: 696f 6e2e 7064 660a ion.pdf.
00000000: 6dc3 bc6e 6368 656e 2e70 6466 0a m..nchen.pdf.
00000000: c39c 6265 7273 6963 6874 2e70 6466 0a ..bersicht.pdf.