Create File In Linux With Unicode File Name

Question

I created a Python script to read a email file using "email" module and extract its attachments to file system, zip the extracted files and email the Zip file to someone.

The attachments may have Unicode file name such as Chinese or Japanese. I found the the module "email.header.decode_header" can retrieve the file name and its encoding. For example:

decode_header(payload.get_filename())

will produce:

[('2015\xe5\xb9\xb4\xe6\xb5\x81\xe5\xb9\xb4_Test.pages', 'utf-8')]

which filename is encoded by UTF-8. Or

[('\x1b$B%Q%=%3%s;q;:4IM}BfD"!J\x1b(BS&T HK\x1b$B!K\x1b(B_\x1b$B8=COD4C#\x1b(BPC.xls', 'iso-2022-jp')]

contains Japanese Characters.

In the script I convert the file name to UTF-8 and saved in file system (Linux) then create a Zip file then send the Zip file via email. When user retrieve and extract the zip file in Windows, the file names in the Zip file changed to rubbish.

I search the Google and StackOverflow I found that Windows file system is Unicode instead of UTF-8. So I can open the Zip file without problem on MacOS but problem on Windows. I also try to change the script to name the file in Unicode format:

filename = unicode('\x1b$B%Q%=%3%s;q;:4IM}BfD"!J\x1b(BS&T HK\x1b$B!K\x1b(B_\x1b$B8=COD4C#\x1b(BPC.xls', 'iso-2022-jp')
f = open(filename, 'wb')
....

I can create a file without problem when I try the above commands in Python shell. However, when I put the exact command into my script, an error

UnicodeEncodeError: 'ascii' codec can't encode characters in position 4-6: ordinal not in range(128)

displayed.

Does anyone can suggest me how to solve this problem so that I can create a Zip file which can open in Windows with correct names.

Related to http://stackoverflow.com/q/13261347/291641 Basically the zip format doesn't really support the filename encoding. Possibly using 7z or another archive format might handle this better. 7z is supposed to support unicode filenames. — patthoyts, Jun 05 '15 at 14:13
I tried open the Zip file by WinRAR and 7z but still cannot get the correct file name. I think it is because Windows cannot recognize UTF8 file name. — user2114189, Jun 05 '15 at 14:17
The zip file format doesn't really support unicode. The 7z format does. Try using 7z archive files -- not 7zip to unzip zip files. — patthoyts, Jun 05 '15 at 14:56
I archive the files using my Python script "pyminizip" module in Linux server. The script is automatically execute when the Linux server receive an email from its Postfix service. So I cannot archive the file manually using 7z. — user2114189, Jun 05 '15 at 15:08
FYI, I can create a .zip file with Unicode names on Windows with WinZip 18.0, and Python 2.7's and 3.3's `zipfile` module can read the name correctly, so it looks like it should be possible, just not with the tools you are using. — Mark Tolonen, Jun 05 '15 at 16:44

score 0 · Answer 1 · answered Jun 05 '15 at 15:14

I am afraid there is no nice solution. AFAIK, the name of the file in a ZIP is an 8bit string. So you have to encode it in the encoding of your choice and utf8 will be correctly understood on Linux and Mac.

On windows, the best you could do would be to translate the zip file :

extract all the files
for each file
- take its name as a 8bit string
- decode the name as utf8 into a unicode string
- encode it as a 8bit string with the native encoding on windows (generally windows-1252 on latin1 languages)
- rename the file with that new name
build a new zip with renamed files

Alternatively try to use another format for the archive, as 7z format is supposed to accept unicode names.

score 0 · Accepted Answer · answered Jun 11 '15 at 02:04

Finally, I found the problem why I can create the file successful in Python interpreter but failed in my Python script. I found that the LANG environment of Python interpreter is "en_US.UTF8" but "C" in my Python script.

import os
print os.environ['LANG']

I think it is the problem that when I create a file with Chinese or Japanese file name in my Python script will produce errors.

I try to run my Python script by:

env LANG=en_US.UTF8 myscript.py

to change the LANG to UTF-8. The problem was solved.

Create File In Linux With Unicode File Name

2 Answers2