How can Python check if a file name is in UTF8?

Question

I have a PHP script that creates a list of files in a directory, however, PHP can see only file names in English and totally ignores file names in other languages, such as Russian or Asian languages.

After lots of efforts I found the only solution that could work for me - using a python script that renames the files to UTF8, so the PHP script can process them after that.

(After PHP has finished processing the files, I rename the files to English, I don't keep them in UTF8).

I used the following python script, that works fine:

import sys
import os
import glob
import ntpath
from random import randint

for infile in glob.glob( os.path.join('C:\\MyFiles', u'*') ):
    if os.path.isfile(infile):
      infile_utf8 = infile.encode('utf8')
      os.rename(infile, infile_utf8)

The problem is that it converts also file names that are already in UTF8. I need a way to skip the conversion in case the file name is already in UTF8.

I was trying this python script:

for infile in glob.glob( os.path.join('C:\\MyFiles', u'*') ):
    if os.path.isfile(infile):
      try:
        infile.decode('UTF-8', 'strict')
      except UnicodeDecodeError:
        infile_utf8 = infile.encode('utf8')
        os.rename(infile, infile_utf8)

But, if file name is already in utf8, I get fatal error:

UnicodeDecodeError: 'ascii' codec can't decode characters in position 18-20
ordinal not in range(128)

I also tried another way, which also didn't work:

for infile in glob.glob( os.path.join('C:\\MyFiles', u'*') ):
    if os.path.isfile(infile):
      try:
        tmpstr = str(infile)
      except UnicodeDecodeError:
        infile_utf8 = infile.encode('utf8')
        os.rename(infile, infile_utf8)

I got exactly the same error as before.

Any ideas?

Python is very new to me, and it is a huge effort for me to debug even a simple script, so please write an explicit answer (i.e. code). I don't have the ability of testing general ideas that maybe work or maybe not. Thanks.

Examples of file names:

 hello.txt
 你好.txt
 안녕하세요.html
 chào.doc

Another alternative would be to use the `unidecode` module and convert the unicode filenames to good-enough ASCII. — Blender, Oct 02 '13 at 01:13
`UnicodeDecodeError: 'ascii'` is not what you're thinking. This means something is trying to be decoded as `'ascii'` not utf8. — monkut, Oct 02 '13 at 01:46
Show the entire error trace. There's only one explicit `decode` but it doesn't look like the error is coming from there. — Mark Ransom, Oct 02 '13 at 02:17
Looks like this was answered few times. http://stackoverflow.com/questions/6707657/python-detect-charset-and-convert-to-utf-8 — Ajay, Oct 02 '13 at 02:50
The error is exactly what I wrote. If you try to run this code and,as an input, use a file name in Asian/Russian/Arabic language, you will get exactly the same error. — Phyton_user, Oct 03 '13 at 02:10

score 4 · Accepted Answer · edited Jun 20 '20 at 09:12

I think you're confusing your terminology and making some wrong assumptions. AFAIK, PHP can open filenames of any encoding type - PHP is very much agnostic about encoding types.

You haven't been clear exactly what you want to achieve as UTF-8 != English and the example foreign filenames could be encoded in a number of ways but never in ASCII English! Can you explain what you think an existing UTF-8 file looks like and what a non-UTF-8 file is?

To add to your confusion, under Windows, filenames are transparently stored as UTF-16. Therefore, you should not try to encode to filenames to UTF-8. Instead, you should use Unicode strings and allow Python to work out the proper conversion. (Don't encode in UTF-16 either!)

Please clarify your question further.

Update:

I now understand your problem with PHP. http://evertpot.com/filesystem-encoding-and-php/ tells us that non-latin characters are troublesome with PHP+Windows. It would seem that only files that are made of Windows 1252 character set characters can be seen and opened.

The challenge you have is to convert your filenames to be Windows 1252 compatible. As you've stated in your question, it would be ideal not to rename files that are already compatible. I've reworked your attempt as:

import os
from glob import glob
import shutil
import urllib

files = glob(u'*.txt')
for my_file in files:
    try:
        print "File %s" % my_file
    except UnicodeEncodeError:
        print "File (escaped): %s" % my_file.encode("unicode_escape")
    new_name = my_file
    try:
        my_file.encode("cp1252" , "strict")
        print "    Name unchanged. Copying anyway"
    except UnicodeEncodeError:
        print "    Can not convert to cp1252"
        utf_8_name = my_file.encode("UTF-8")
        new_name = urllib.quote(utf_8_name )
        print "    New name: (%% encoded): %s" % new_name
    
    shutil.copy2(my_file, os.path.join("fixed", new_name))

breakdown:

Print filename. By default, the Windows shell only shows results in a local DOS code page. For example, my shell can show ü.txt but €.txt shows as ?.txt. Therefore, you need to be careful of Python throwing Exceptions because it can't print properly. This code, attempts to print the Unicode version but resorts to print Unicode code point escapes instead.
Try to encode string as Windows-1252. If this works, filename is ok
Else: Convert the filename to UTF-8, then percent encode it. This way, the filename remains unique and you could reverse this procedure in PHP.
Copy file to new/verified file.

For example, 你好.txt becomes %E4%BD%A0%E5%A5%BD.txt

What I want to do is very basic - to read a file name using PHP. However, PHP in windows OS cannot see file names in foreign languages such as: 你好.txt , 안녕하세요.html, chào.doc . You can try it by yourself by creating such files and try to list them using PHP (glob, scandir, etc'). If you succeed, please let me know how you did it. — Phyton_user, Oct 08 '13 at 04:39
I've done some research and know understand what your restrictions are. Please see my update to my answer — Alastair McCormack, Oct 08 '13 at 11:39
Your method is very nice, but so far I could not use it because it often creates huge file names that windows cannot write/read (more than 255 characters). Any other ideas? — Phyton_user, Oct 28 '13 at 18:42
@Tom, I'm really disappointed you unaccepted my answer. I put a lot of time and effort in answering your question with a solution. If your requirements have changed then you ought to create a new question. Some initial thoughts are: use short paths and gzip+base64 encode the filename or if the original filename is not important you could hash the filename. — Alastair McCormack, Oct 28 '13 at 21:12
You are right. Your solution is the best so far and it solves most of the problem. I gave you the credit back. Well done and many thanks. Regardless - this programming problem needs more development. Maybe I will continue it in a new question, as you suggested. — Phyton_user, Oct 29 '13 at 08:27

score 3 · Answer 2 · answered Oct 03 '13 at 07:32

3

For all UTF-8 issues with Python, I warmly recommand spending 36 minutes watching the "Pragmatic Unicode" by Ned Batchelder (http://nedbatchelder.com/text/unipain.html) at PyCon 2012. For me it was a revelation ! A lot from this presentation is in fact not Python-specific but helps understanding important things like the difference between Unicode strings and UTF-8 encoded bytes...

The reason I'm recommending this video to you (like I did for many friends) is because some your code contains contradictions like trying to decode and then encode if decoding fails : such methods cannot apply to the same object ! Even though in Python2 it's syntaxically possible possible, it makes no sense, and in Python 3, the disctinction between bytes and str makes things clearer:

A str object can be encoded in bytes:

>>> a = 'a'
>>> type(a)
<class 'str'>
>>> a.encode
<built-in method encode of str object at 0x7f1f6b842c00>
>>> a.decode
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'

...while a bytes object can be decoded in str:

>>> b = b'b'
>>> type(b)
<class 'bytes'>
>>> b.decode
<built-in method decode of bytes object at 0x7f1f6b79ddc8>
>>> b.encode
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'encode'

Coming back to your question of working with filenames, the tricky question you need to answer is: "what is the encoding of your filenames". The language doesn't matter, only the encoding !

answered Oct 03 '13 at 07:32

Pierre H.

388
1
11

I am trying to decode to see if variable infile is in UTF8 or not. If decoding fails, it means that it was never encoded in UTF8 so it can be encoded now. – Phyton_user Oct 03 '13 at 09:11
Please note that the result of the decoding does not go to any variable. Do you still think there is a contradiction? – Phyton_user Oct 03 '13 at 09:14
Ok, the problem is I quite don't understand what *converting to English* or *converting to UTF-8* really means. It would be nice if you could provide the repr()` of two filenames: one that you want to convert, one that you don't. (I can't comment directly on your question) – Pierre H. Oct 03 '13 at 13:28
I have added examples of file names at the end of my question above. – Phyton_user Oct 03 '13 at 13:50
good, but what is the `repr()` of these names when they're read by glob ? Do you get a Unicode object or an str ? And also, how would you like these names to be converted "in English" ? – Pierre H. Oct 03 '13 at 14:41
PHP cannot even see these file names. After I convert them to UTF8 (using Python), PHP can finally see them. When PHP is able to access these files, then I can continue from this point in any way I choose. It is true that after conversion to UTF8 file names look like Gibberish, but I don't mind because I know that this Gibberish is UTF8 and I can do any conversion that I need within PHP. In PHP I get a string, not an object. – Phyton_user Oct 03 '13 at 16:20
I just ran across this valuable addition in Python 3.2 : os.fsdecode() to decode filenames http://docs.python.org/3.3/whatsnew/3.2.html#os – Pierre H. Oct 25 '13 at 14:12

How can Python check if a file name is in UTF8?

2 Answers2