Python not able to open file with non-english characters in path

Question

I have a file with the following path : D:/bar/クレイジー・ヒッツ！/foo.abc

I am parsing the path from a XML file and storing it in a variable called path in the form of file://localhost/D:/bar/クレイジー・ヒッツ！/foo.abc Then, the following operations are being done:

path=path.strip()
path=path[17:] #to remove the file://localhost/  part
path=urllib.url2pathname(path)
path=urllib.unquote(path)

The error is:

IOError: [Errno 2] No such file or directory: 'D:\\bar\\\xe3\x82\xaf\xe3\x83\xac\xe3\x82\xa4\xe3\x82\xb8\xe3\x83\xbc\xe3\x83\xbb\xe3\x83\x92\xe3\x83\x83\xe3\x83\x84\xef\xbc\x81\\foo.abc'

I am using Python 2.7 on Windows 7

Try using a unicode path string instead: `path = path.decode('utf8')` before the rest of your code. — Duncan, May 12 '11 at 07:27
@vr3690 Are you on Windows or not ? Could you confirm this, please. — eyquem, May 12 '11 at 08:03
@Ignacio Vazquez-Abrams - how do I use a different encoding? What should I use here? — bcosynot, May 12 '11 at 08:06
Well, i am just answer your question via aardvark. It's so coincidence. ：D — winterTTr, May 12 '11 at 09:27
had a similar issue on my Mac. My characters were french. The circumflex characters I had for the file name were not correct french letters. Would get file not found when the output showed the correct path - There was a lost in translation moment between mac and python for these. I went to the file name and gave it the correct circumflex characters and then no issues. This helped me find this solution: https://stackoverflow.com/questions/19284130/python3-qt-unicode-file-name-problems — mmv_sat, Aug 05 '18 at 00:10

MattH · Accepted Answer · 2011-05-12T11:17:54.657

The path in your error is:

'\xe3\x82\xaf\xe3\x83\xac\xe3\x82\xa4\xe3\x82\xb8\xe3\x83\xbc\xe3\x83\xbb\xe3\x83\x92\xe3\x83\x83\xe3\x83\x84\xef\xbc\x81'

I think this is the UTF8 encoded version of your filename.

I've created a folder of the same name on Windows7 and placed a file called 'abc.txt' in it:

>>> a = '\xe3\x82\xaf\xe3\x83\xac\xe3\x82\xa4\xe3\x82\xb8\xe3\x83\xbc\xe3\x83\xbb\xe3\x83\x92\xe3\x83\x83\xe3\x83\x84\xef\xbc\x81'
>>> os.listdir('.')
['?????\xb7???!']
>>> os.listdir(u'.') # Pass unicode to have unicode returned to you
[u'\u30af\u30ec\u30a4\u30b8\u30fc\u30fb\u30d2\u30c3\u30c4\uff01']
>>> 
>>> a.decode('utf8') # UTF8 decoding your string matches the listdir output
u'\u30af\u30ec\u30a4\u30b8\u30fc\u30fb\u30d2\u30c3\u30c4\uff01'
>>> os.listdir(a.decode('utf8'))
[u'abc.txt']

So it seems that Duncan's suggestion of path.decode('utf8') does the trick.

Update

I can't test this for you, but I suggest that you try checking whether the path contains non-ascii before doing the .decode('utf8'). This is a bit hacky...

ASCII_TRANS = '_'*32 + ''.join([chr(x) for x in range(32,126)]) + '_'*130
path=path.strip()
path=path[17:] #to remove the file://localhost/  part
path=urllib.unquote(path)
if path.translate(ASCII_TRANS) != path: # Contains non-ascii
  path = path.decode('utf8')
path=urllib.url2pathname(path)

It does. But, gives rise to another problem. I am finally using the following code to get myself a usable path: ` path=urllib.unquote(path) path=path.decode('utf8') path=urllib.url2pathname(path) ` Which gives rise to this error : `IOError: [Errno 2] No such file or directory: u'D:\\Music\\Pink Floyd\\The Wall Disc 1\\5 - Another Brick in the Wall, Pt. 2.mp3' ` Any idea what could be the problem with this path? — bcosynot, May 12 '11 at 10:39
@vr3690, might be easier to start a new question and go from there. — MattH, May 12 '11 at 13:14
oops. it turns out I was dealing with some old data. That file actually did not exist and was deleted. But I really think your updated code is the way to go. Thanks a lot for your help! — bcosynot, May 12 '11 at 13:22

score 2 · Answer 2 · answered May 12 '11 at 10:11

2

Provide the filename as a unicode string to the open call.

How do you produce the filename?

if provided as a constant by you

Add a line near the beginning of your script:

# -*- coding: utf8 -*-

Then, in a UTF-8 capable editor, set path to the unicode filename:

path = u"D:/bar/クレイジー・ヒッツ！/foo.abc"

read from a list of directory contents

Retrieve the contents of the directory using a unicode dirspec:

dir_files= os.listdir(u'.')

read from a text file

Open the filename-containing-file using codecs.open to read unicode data from it. You need to specify the encoding of the file (because you know what is the “default windows charset” for non-Unicode applications on your computer).

in any case

Do a:

path= path.decode("utf8")

before opening the file; substitute the correct encoding if not "utf8".

answered May 12 '11 at 10:11

tzot

92,761
29
141
204

It does. But, gives rise to another problem. I am finally using the following code to get myself a usable path: ` path=urllib.unquote(path) path=path.decode('utf8') path=urllib.url2pathname(path) ` Which gives rise to this error : IOError: [Errno 2] No such file or directory: u'D:\\Music\\Pink Floyd\\The Wall Disc 1\\5 - Another Brick in the Wall, Pt. 2.mp3' Any idea what could be the problem with this path? – bcosynot May 12 '11 at 11:13
Anything for a fellow Floyd fan :) Right before the `open`, please `print(repr(path))` first to ensure that backslashes are as many as they should be. Post back here. – tzot May 12 '11 at 12:00
@ΤΖΩΤΖΙΟΥ ahh. Floyd. Here you go - `u'D:\\Music\\Pink Floyd\\The Wall Disc 1\\5 - Another Brick in the Wall, Pt. 2.mp3'` – bcosynot May 12 '11 at 12:50
1

Turns out,the file actually did not exist. I was dealing with some old data. Thanks a lot for your help, though! – bcosynot May 12 '11 at 13:23

score 1 · Answer 3 · answered May 12 '11 at 08:34

1

Here's some interesting stuff from the documentation:

sys.getfilesystemencoding()

Return the name of the encoding used to convert Unicode filenames into system file names, or None if the system default encoding is used. The result value depends on the operating system: On Mac OS X, the encoding is 'utf-8'. On Unix, the encoding is the user’s preference according to the result of nl_langinfo(CODESET), or None if the nl_langinfo(CODESET) failed. On Windows NT+, file names are Unicode natively, so no conversion is performed. getfilesystemencoding() still returns 'mbcs', as this is the encoding that applications should use when they explicitly want to convert Unicode strings to byte strings that are equivalent when used as file names. On Windows 9x, the encoding is 'mbcs'.

New in version 2.3.

If I understand this correctly, you should pass the file name as unicode:

f = open(unicode(path, encoding))

answered May 12 '11 at 08:34

codeape

97,830
24
159
188

1

Ok, I did this : `path=unicode(path,sys.getfilesystemencoding())` got this error - `TypeError: decoding Unicode is not supported` – bcosynot May 12 '11 at 09:05
I think the path should already be a unicode, so you should try path.encode( encoding ) – winterTTr May 12 '11 at 09:28
OK, so it seems that path already is a unicode string. In that case, I would try to encode the path as mbcs: path = path.encode(sys.getfilesystemencoding()); open(path).read(). – codeape May 12 '11 at 09:28
i played around with encoding a bit. Although it solved my initial problem, I now have another problem at hand. Take a look at my comment on the the other answer by MattH – bcosynot May 12 '11 at 10:41

Python not able to open file with non-english characters in path

3 Answers3

if provided as a constant by you

read from a list of directory contents

read from a text file

in any case

Linked