1

I'm trying to zip a folder in Python 3 with the module zipfile.

Since I'm german I have some filenames containing umlauts (äöü).

While zipping, I get a UnicodeEncodeError: 'utf-8' codec can't encode character '\udcfc' in position 95: surrogates not allowed.

The character in question is an ü.

How can I get zipfile to zip all my files?

The relevant code is this:

def zipdir(path, ziph):
    for root, dirs, files in os.walk(path):
        for file in files:
            ziph.write(os.path.join(root, file))

if __name__ == '__main__':
    zipf = zipfile.ZipFile('path/to/destination', 'w', zipfile.ZIP_DEFLATED)
    zipdir('path/to/folder', zipf)
    zipf.close()

Edit:
I've got the same error when I'm using shutil.make_archive.

import shutil

shutil.make_archive('/path/to/destination', 'zip', '/path/to/folder')

Full stacktrace of shutil.make_archive():

Traceback (most recent call last):
  File "/usr/lib64/python3.7/zipfile.py", line 452, in _encodeFilenameFlags
    return self.filename.encode('ascii'), self.flag_bits
UnicodeEncodeError: 'ascii' codec can't encode character '\udcfc' in position 59: ordinal not in range(128)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run.py", line 39, in <module>
    archive_dir(path, zip_fullpath)
  File "run.py", line 19, in archive_dir
    shutil.make_archive(dest, 'zip', source)
  File "/home/sean/.local/share/virtualenvs/backup-script-QUcRKrDQ/lib/python3.7/shutil.py", line 822, in make_archive
    filename = func(base_name, base_dir, **kwargs)
  File "/home/sean/.local/share/virtualenvs/backup-script-QUcRKrDQ/lib/python3.7/shutil.py", line 720, in _make_zipfile
    zf.write(path, path)
  File "/usr/lib64/python3.7/zipfile.py", line 1746, in write
    with open(filename, "rb") as src, self.open(zinfo, 'w') as dest:
  File "/usr/lib64/python3.7/zipfile.py", line 1473, in open
    return self._open_to_write(zinfo, force_zip64=force_zip64)
  File "/usr/lib64/python3.7/zipfile.py", line 1586, in _open_to_write
    self.fp.write(zinfo.FileHeader(zip64))
  File "/usr/lib64/python3.7/zipfile.py", line 442, in FileHeader
    filename, flag_bits = self._encodeFilenameFlags()
  File "/usr/lib64/python3.7/zipfile.py", line 454, in _encodeFilenameFlags
    return self.filename.encode('utf-8'), self.flag_bits | 0x800
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcfc' in position 59: surrogates not allowed

Full stacktrace of zipfile:

Traceback (most recent call last):
  File "/usr/lib64/python3.7/zipfile.py", line 452, in _encodeFilenameFlags
    return self.filename.encode('ascii'), self.flag_bits
UnicodeEncodeError: 'ascii' codec can't encode character '\udcfc' in position 95: ordinal not in range(128)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run.py", line 41, in <module>
    zipdir(path, zipf)
  File "run.py", line 16, in zipdir
    ziph.write(filepath)
  File "/usr/lib64/python3.7/zipfile.py", line 1746, in write
    with open(filename, "rb") as src, self.open(zinfo, 'w') as dest:
  File "/usr/lib64/python3.7/zipfile.py", line 1473, in open
    return self._open_to_write(zinfo, force_zip64=force_zip64)
  File "/usr/lib64/python3.7/zipfile.py", line 1586, in _open_to_write
    self.fp.write(zinfo.FileHeader(zip64))
  File "/usr/lib64/python3.7/zipfile.py", line 442, in FileHeader
    filename, flag_bits = self._encodeFilenameFlags()
  File "/usr/lib64/python3.7/zipfile.py", line 454, in _encodeFilenameFlags
    return self.filename.encode('utf-8'), self.flag_bits | 0x800
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcfc' in position 95: surrogates not allowed

Update:

I've tried some solutions that seemed to work for some at the posted link. This is what I've got:
with ziph.write(filepath.encode('utf8','surrogateescape').decode('ISO-8859-1')) I got:

Traceback (most recent call last):
  File "run.py", line 41, in <module>
    zipdir(path, zipf)
  File "run.py", line 16, in zipdir
    ziph.write(filepath.encode('utf8','surrogateescape').decode('ISO-8859-1'))
  File "/usr/lib64/python3.7/zipfile.py", line 1713, in write
    zinfo = ZipInfo.from_file(filename, arcname)
  File "/usr/lib64/python3.7/zipfile.py", line 506, in from_file
    st = os.stat(filename)
FileNotFoundError: [Errno 2] No such file or directory: '/some/path/to/documents/DIS_Broschüre_DE.pdf'

So the encoding/decoding returned something that can not be found in the file system.

The other option: ziph.write(filepath.encode('utf8','surrogateescape').decode('utf-8')) got me

Traceback (most recent call last):
  File "run.py", line 41, in <module>
    zipdir(path, zipf)
  File "run.py", line 16, in zipdir
    ziph.write(filepath.encode('utf8','surrogateescape').decode('utf-8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 96: invalid start byte
sebastian
  • 478
  • 8
  • 22
  • What is your OS, python version? Did your python file encoded in utf-8 (not sure it is relevant)? I can't reproduce your error on Linux. – ndclt Sep 28 '19 at 19:58
  • @ndclt OS is Manjaro Kernel 5.3, Python 3.7.4 and it is encoded in utf-8. and `echo $LANG` gives `de_DE.utf8` – sebastian Sep 28 '19 at 20:16
  • Could you add the full stacktrace not just the last line? – ndclt Sep 28 '19 at 20:28
  • @ndclt appended stacktraces. – sebastian Sep 28 '19 at 20:41
  • I think I have your answer [there](https://stackoverflow.com/a/27367173). I don't know if the [pathlib](https://docs.python.org/3/library/pathlib.html) has better behaviour. – ndclt Sep 28 '19 at 20:59
  • @ndclt Thanks for the link. I've tried some options but they dont seem to work. Which answer did you mean in particular? – sebastian Sep 28 '19 at 21:24
  • The one targeted by the link and accepted. It was the last line of the answer which was interesting: `for p,d,f in os.walk(b'.'):`. With this, I think you can try to encode with the correct format (even if your OS says it is in utf-8). – ndclt Sep 28 '19 at 21:33
  • @ndclt I've tried that too. That doesn't work at all. The for loop does not even start. I also tried Pathlib with the same problem. I now circumvented all that by using `subprocess.call(['zip', '-r', 'to.zip', 'from'])`. Why is this such a problem? Shouldn't these basic things just work by now? – sebastian Sep 28 '19 at 21:44
  • Your original zipfile code seems to work for me when I set the `LC_ALL` environment variable to `en_US.UTF-8` using the command `export LC_ALL='en_US.UTF-8'` before running the script. Could you try that and see if it works? – VietHTran Sep 28 '19 at 22:25
  • @VietHTran nope. doesn't change anything. – sebastian Sep 28 '19 at 22:54

1 Answers1

1

Ok. I've found the Problem. The files in questen were not the ones I thought they were. Usual umlaus work fine. Somehow the filenames were actually corrupt. like this:

ls in one of the dirs gives:
2e_geh�usetechnologie_flyer_qrcode.pdf

Command line auto completion gives me:
2e_geh$'\344'usetechnologie_flyer_qrcode.pdf

Since these are files that got uploaded via a webinterface I can only imagine that these are made in Windows or another non-UNIX OS and the webserver couldn't handle it.

Other uploaded files had correct umlauts. I'm not shure what happened there but I'm glad it is not Python or the Linux FS to blame.

Thanks for all the tips.

sebastian
  • 478
  • 8
  • 22