Windows file names displayed corrupted characters in Linux

Question

I believe this is a common issue when it comes to the default encoding of characters on Linux and Windows. However after I searched the internet I have not got any easy way to fix it automatically and therefore I am about to write a script to do it.

Here is the scenario:

I created some files on Windows system, some with non-English names (Chinese specifically in my case). And I compressed them into a zip file using 7-zip. After that I downloaded the zip file to a Linux and extract the files on the Linux system (Ubuntu 16.04 LTS) (the default archive program). As much as I have guessed, all the non-English file names are now displayed as some corrupted characters! At first I thought it should be easy with convmv, but...

I tried convmv, and it says:"Skipping, already utf8". Nothing got changed.

And then I decided to write a tool using Python to do the dirty job, after some testing I come to a point where I cannot associate the original file names to the corrupted file names, (unless by hashing the contents.)

Here is an example. I setup a webserver to list the file names on Windows, and one file, after encoded with "gbk" in python, is displayed as

u'j\u63a5\u53e3\u6587\u6863'

And I can query the file names on my Linux system. I can create a file directly with the name as shown above, and the name is CORRECT. I can also encode the unicode gbk string to utf8 encoding and create a file, the name is also CORRECT. (Thus I am not able to do them at the same time since they are indeed the same name). Now when I read the file name which I extracted earlier, which should be the same file. BUT the file name is completely different as:

'j\xe2\x95\x9c\xe2\x95\x99.....'

decoding it with utf8, it is something like u'j\u255c\u2559...'. decoding it with gbk resulted in UnicodeDecodeError exception, and I also tried to decode it with utf8 and then encode with gbk, but the result is still something else.

To summarize it, I cannot inspect the original file name by decoding or encoding it after it was extracted to the linux system. If I really want to let a program do the job, I have to either re-do the archive with possibly some encoding options maybe, or just go with my script but using file content hash (like md5 or sha1) to determine its original file name on Windows.

Do I still got any chance to infer the original name from a python script in above case other than comparing file contents between two systems?

Dupe of other questions: http://stackoverflow.com/questions/9974779/using-unicode-characters-for-file-names-inside-a-zip-archive — selbie, Feb 18 '17 at 09:45
Do an Internet search for "zip file and unicode filenames". You aren't the first to hit this. — selbie, Feb 18 '17 at 09:48
`u'j\u63a5\u53e3\u6587\u6863'` yields: `'j接口文档'`. Is that correct? — Alastair McCormack, Feb 18 '17 at 10:53
Also, are you creating 7zip format archives (.7z) or PKWARE Zip archives (.zip)? — Alastair McCormack, Feb 18 '17 at 11:24
@AlastairMcCormack That was correct and the format is zip, I will follow the link selbie provided. — Qianqian, Feb 18 '17 at 11:35
@selbie, Thank you. I did not search with these keyword indeed. My bad. — Qianqian, Feb 18 '17 at 11:35
When you unzip on Linux, what's your locale? E.g. Run `locale` — Alastair McCormack, Feb 18 '17 at 11:36
@AlastairMcCormack, I actually did not specify the locale, I assume the default was used, ,which was English locale. — Qianqian, Feb 18 '17 at 11:38
Your Linux session has a locale, which will be set by Ubuntu during install. Run `locale` on the command line to see what it is. — Alastair McCormack, Feb 18 '17 at 11:39
@AlastairMcCormack, Surely I will. I will do it as soon as I have access to that ubuntu PC. — Qianqian, Feb 18 '17 at 11:40

score 3 · Accepted Answer · answered Feb 18 '17 at 13:59

3

With a little experimentation with common encodings, I was able to reverse your mojibake:

bad = 'j\xe2\x95\x9c\xe2\x95\x99\xe2\x94\x90\xe2\x94\x8c\xe2\x95\xac\xe2\x94\x80\xe2\x95\xa1\xe2\x95\xa1'
>>> good = bad.decode('utf8').encode('cp437').decode('gbk')
>>> good
u'j\u63a5\u53e3\u6587\u6863'        # u'j接口文档'

gbk - common Chinese Windows encoding
cp437 - common US Windows OEM console encoding
utf8 - common Linux encoding

answered Feb 18 '17 at 13:59

Mark Tolonen

166,664
26
169
251

Wow, this is fantastic! I never thought about cp437 when tried cp936. Thank you! – Qianqian Feb 19 '17 at 08:59

Windows file names displayed corrupted characters in Linux

1 Answers1