Python reading unicode file names

Question

I am new to Python. I am trying to use os.path.getsize() to obtain the file's size. However, if the file name is not in English, but in Chinese, German, French, etc, Python cannot recognize it and does not return the size of the file. Could you please help me with it? How can I let Python recognize the file's name and return the size of these kind of files?

For example: The file's name is: "Показатели естественного и миграционного прироста до 2030г.doc". path="C:\xxxx\xxx\xxxx\Показатели естественного и миграционного прироста до 2030г.doc"

I'd like to use os.path.getsize(path). But it does not recognize the file name. Could you please kindly tell me what should I do?

Thank you very much!

import codecs,cStringIO

class UnicodeWriter:

        def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
        self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()
    def writerow(self, row):
        self.writer.writerow([s.encode("utf-8") for s in row])
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        data = self.encoder.encode(data)
        self.stream.write(data)
        self.queue.truncate(0)
    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

This seems to work fine in python for me. However i am running linux. This may be platform dependant (e.g. only affect windows). So i apologize as i can not help with this. However I would recommend trying to use escape codes e.g "\xd0" for "П". — luke, Jul 02 '13 at 23:14

Mark Tolonen · Answer 1 · 2013-07-03T15:45:03.490

2

Use a Unicode path and make sure to specify the encoding the source file is saved in:

#python2
#coding: utf8
import os
path = u'Показатели естественного и миграционного прироста до 2030г.doc'
with open(path,'w') as f:
    f.write('hello')
print os.path.getsize(path)

Result:

Check that the file was created correctly:

C:\>dir *.doc
 Volume in drive C has no label.
 Volume Serial Number is CE8B-D448

 Directory of C:\

07/02/2013  09:51 PM                 5 Показатели естественного и миграционного прироста до 2030г.doc
               1 File(s)              5 bytes
               0 Dir(s)  83,018,432,512 bytes free

Edit in response to comment

If you need to process a number of files, use os.listdir(u'path/to/files') (with a Unicode directory path) and that will read a directory and return the filenames in Unicode. If you need recursion, use os.walk(u'path/to/files').

edited Jul 03 '13 at 15:45

answered Jul 03 '13 at 04:54

Mark Tolonen

166,664
26
169
251

Thank you very much! But what if I have lots of files to read in? I – Ruxuan Ouyang Jul 03 '13 at 13:44
Thank you. But why does Python tell me "Unsupported characters in input" when I try path = u'Показатели естественного и миграционного прироста до 2030г.doc'? – Ruxuan Ouyang Jul 09 '13 at 12:51
Did you add the `#coding: utf8` comment? This tells Python what encoding the source file is saved in. You also obviously have to save the source file in that encoding. If you leave it out, Python 2.x assumes ASCII and you won't be able to use non-ASCII characters in the source. You can use other encodings, but UTF-8 supports all Unicode characters. – Mark Tolonen Jul 10 '13 at 00:28
I see. Many thank! If I also want to write the original name into csv file, do I need to do anything else? May I use unicode() or .encode() or .decode() function? Basically, what I am doing is that I input one path and use os.walk() to search all files, and want to return files' name and size into csv file. Thank you very much! – Ruxuan Ouyang Jul 10 '13 at 13:12
.encode() will do it, or see the csv module and the example at the bottom of the Python csv docs. Ask another question if you have trouble, and accept an answer for this one if it was helpful. – Mark Tolonen Jul 10 '13 at 14:48
And I use UnicodeWriter (please see my edit) to write them into the csv file. But there is a problem that "'int' object has no attribute 'encode'". I think there are some parts of name or path which are numbers. I tried to change all of them to string by str(). Then the unicode cannot be recognize again. Would you please help me with this? Thank you very much! – Ruxuan Ouyang Jul 10 '13 at 15:27
I tried .encode(). It seems not work. But the previous one works well except when I write them into csv file. – Ruxuan Ouyang Jul 10 '13 at 16:10
Don't change your question. Ask a new one. – Mark Tolonen Jul 11 '13 at 01:27
Hi! I have asked a new one. The address is: http://stackoverflow.com/questions/18105873/python-reading-unicode-forlder-and-file-names – Ruxuan Ouyang Aug 07 '13 at 14:19

Python reading unicode file names

1 Answers1

Edit in response to comment