1

I believe this is a common issue when it comes to the default encoding of characters on Linux and Windows. However after I searched the internet I have not got any easy way to fix it automatically and therefore I am about to write a script to do it.

Here is the scenario:

I created some files on Windows system, some with non-English names (Chinese specifically in my case). And I compressed them into a zip file using 7-zip. After that I downloaded the zip file to a Linux and extract the files on the Linux system (Ubuntu 16.04 LTS) (the default archive program). As much as I have guessed, all the non-English file names are now displayed as some corrupted characters! At first I thought it should be easy with convmv, but...

I tried convmv, and it says:"Skipping, already utf8". Nothing got changed.

And then I decided to write a tool using Python to do the dirty job, after some testing I come to a point where I cannot associate the original file names to the corrupted file names, (unless by hashing the contents.)

Here is an example. I setup a webserver to list the file names on Windows, and one file, after encoded with "gbk" in python, is displayed as

u'j\u63a5\u53e3\u6587\u6863'

And I can query the file names on my Linux system. I can create a file directly with the name as shown above, and the name is CORRECT. I can also encode the unicode gbk string to utf8 encoding and create a file, the name is also CORRECT. (Thus I am not able to do them at the same time since they are indeed the same name). Now when I read the file name which I extracted earlier, which should be the same file. BUT the file name is completely different as:

'j\xe2\x95\x9c\xe2\x95\x99.....'

decoding it with utf8, it is something like u'j\u255c\u2559...'. decoding it with gbk resulted in UnicodeDecodeError exception, and I also tried to decode it with utf8 and then encode with gbk, but the result is still something else.

To summarize it, I cannot inspect the original file name by decoding or encoding it after it was extracted to the linux system. If I really want to let a program do the job, I have to either re-do the archive with possibly some encoding options maybe, or just go with my script but using file content hash (like md5 or sha1) to determine its original file name on Windows.

Do I still got any chance to infer the original name from a python script in above case other than comparing file contents between two systems?

Barry Michael Doyle
  • 9,333
  • 30
  • 83
  • 143
Qianqian
  • 2,104
  • 18
  • 28

1 Answers1

3

With a little experimentation with common encodings, I was able to reverse your mojibake:

bad = 'j\xe2\x95\x9c\xe2\x95\x99\xe2\x94\x90\xe2\x94\x8c\xe2\x95\xac\xe2\x94\x80\xe2\x95\xa1\xe2\x95\xa1'
>>> good = bad.decode('utf8').encode('cp437').decode('gbk')
>>> good
u'j\u63a5\u53e3\u6587\u6863'        # u'j接口文档'

gbk - common Chinese Windows encoding
cp437 - common US Windows OEM console encoding
utf8 - common Linux encoding

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251