1

I have a file that has a Unicode name, say 'קובץ.txt'. I want to pack him, and I'm using python's zipfile.

I can zip the files and open them later on with a problem except that file names are messed up when using windows 7 file explorer to view the files (7zip works great).

According to the docs, this is a common problem, and there are instructions on how to deal with that:

From ZipFile.write

Note

There is no official file name encoding for ZIP files. If you have unicode file names, you must convert them to byte strings in your desired encoding before passing them to write(). WinZip interprets all file names as encoded in CP437, also known as DOS Latin.

Sorry, but I can't seem to get what exactly am I supposed to do with the filename. I've tried .encode('CP437'), .decode('CP437')..

A-Palgy
  • 1,291
  • 2
  • 14
  • 30
  • `zipfile` module uses utf-8 encoding instead of cp437 for non-ascii filenames and sets `flag_bits | 0x800` while compressing. utf-8 encoding supports the full Unicode range (ignoring lone surrogates). You can both compress/decompress the file using Python. Or use `-mcu` switch to decompress it using 7-zip. See also, [Correctly decoding zip entry file names — CP437, UTF-8 or?](http://stackoverflow.com/q/13261347/4279) – jfs Nov 28 '15 at 14:47
  • change the title of your question, to be more closely related to your actual task e.g., "create a zip archive with non-ascii entries". Where does `'קובץ.txt'` come from? Is it given as a command line argument? What is your python version? What happens if you run from the command-line: `py -mzipfile -c archive.zip קובץ.txt` in a directory that contains `קובץ.txt` file? – jfs Nov 30 '15 at 17:32

3 Answers3

8

You'd have to encode your Unicode string to CP437. However, you can't encode your specific example because the CP437 codec does not support Hebrew:

>>> u'קובץ.txt'.encode('cp437')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mjpieters/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-3: character maps to <undefined>

The above error tells you that the first 4 characters (קובץ) cannot be encoded because there are no such characters in the target characterset. CP437 only supports the western alphabet (A-Z, and accented characters like ç and é), IBM line drawing characters (such as ╚ and ┤) and a few greek symbols, mainly for math equations (such as Σ and φ).

You'll either have to generate a different filename that only uses characters supported by the CP437 codec or live with the fact that WinZip will never be able to show Hebrew filenames properly, and simply stick with the characterset that did work for you with 7zip.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thanks. I think I need to rephrase my question, I'm using windows 7 to test and I think windows explorer is the program I'm using to open the a file. Anyway, I'm able to create zip file that contains Hebrew text with a program, like "Total commander", I just can't create the same thing with python's zipfile – A-Palgy Nov 26 '15 at 16:32
  • 2
    @A-Palgy: you'll need to show more information on how you do that then, like exactly what type of objects your filenames are (encoded strings, or `unicode` objects, for example). When you use Total commander to create a zip file, what does that look like in WinZip? What codec is used for the filenames, do you know? You'll have to use the same codec from Python. – Martijn Pieters Nov 26 '15 at 16:37
  • How can I check what codec is used for the filenames? – A-Palgy Nov 30 '15 at 09:30
  • What does the raw data look like when you open that zipfile? `repr(name)` would give you a re-usable representation of the bytes (in Python 2). It should be immediately clear if UTF-8 or UTF-16 was used, for example. – Martijn Pieters Nov 30 '15 at 09:55
  • repr of the file name looks like this: `'\\'_\\xf4_\\xf5_\\xbf_\\xf0_\\xf5_\\xac _\\xa3_\\xf4_\\xf6_____\\xf4 #12.xlsx\\''` which should have been `הוראותלהזמנה #12.xlsx` Thanks anyway, I think I'll stick with 7zip for now – A-Palgy Jan 03 '16 at 10:02
  • @A-Palgy: that's not a valid `repr()`; that looks doubly quoted and the single quotes around it don't work with single quotes in the value. The `\xhh` byte count doesn't match your expected output either, I'm afraid. Last but not least, I can't find any correlation with any Hebrew-capable codecs that Python can handle out-of-the-box. – Martijn Pieters Jan 04 '16 at 10:21
0

try this

import zipfile
p=b'\xd7\xa7\xd7\x95\xd7\x91\xd7\xa5.txt'.decode('utf8')
# or just:
# p='קובץ.txt'
z=zipfile.ZipFile('test.zip','w')
f=z.open(p.encode('utf8').decode('cp437'),'w')
f.write(b'hello world')
f.close()
z.close()

I've tried on a MacOSX, so it's not cp437 above, but utf8, and it works

I hope this works on windows

I've tested reading Chinese filenames with "gbk" or "gb18030" encoding with similar codes. And it works well.

When you have a zip archive from (or needs to send it to) Mac/Linux, change cp437 in the code to utf8 and everything works

When you have a zip archive from (or needs to send it to) Windows, leave cp437 unchanged

cdarlint
  • 1,485
  • 16
  • 14
0

For CP866 (Russian) this works:

    from zipfile import ZipFile, ZipInfo

    class ZipInf(ZipInfo):
        def __init__(self, filename):
            super().__init__(filename)
            self.create_system = 0
        def _encodeFilenameFlags(self):
            return self.filename.encode('cp866'), self.flag_bits

    with ZipFile('ex.zip', 'w') as zipf:
        zipf.writestr(ZipInf('Файл'), '123456789'*1024)

It saves dirs and filenames in zip cp866 encoded (here is only 'Файл' file).