listdir doesn't print non-english letters correctly

Question

On Python 2.7,

for dir in os.listdir("E:/Library/Documents/Old - Archives/Case"):
   print dir

prints out:

Danny.xlsx
Dannyh.xlsx
~$??? ?? ?????? ??? ???? ???????.docx

while this:

# using a unicode literal
for dir in os.listdir(u"E:/Library/Documents/Old - Archives/Case"):
   print dir

prints out:

Dan.xlsx
Dann.xlsx

Traceback (most recent call last):
  File "E:\...\FirstModule.py", line 31, in <module>
    print dir
  File "C:\Python27\lib\encodings\cp1252.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 2-4: character maps to <undefined>

The file's name is in Hebrew, as such: המסמך.xls

How can I make it appear in Hebrew in Python too?

more oddness: `s = os.listdir(u"E:/Library/Documents/Old - Archives/Case")[2] print s ` works just fine. — mirandalol, Mar 31 '12 at 10:42
Solved it: `# -*- coding: utf-8 -*-` at the top of the document solved it. — mirandalol, Mar 31 '12 at 10:45
When you solve your own problem, you should post the answer as an answer not a comment and accept it. — agf, Mar 31 '12 at 11:02
@Saga That makes no sense. How in the world does declaring the source encoding have any effect on the I/O? — tchrist, Mar 31 '12 at 15:24

score 6 · Answer 1 · answered Mar 31 '12 at 14:34

6

The version with u'' string literal works fine: ask with a Unicode pathname and you'll get a Unicode pathname in response, allowing you to work with filenames that include characters outside the current code page.

Your problem comes solely from trying to print the filename. Getting Unicode output to the Windows Command Prompt is a trial.

The default C standard library print function is limited to the locale code page. Unless you call the Win32 API function WriteConsoleW directly (using ctypes) you're never going to get reliable console Unicode support; and even then it won't work unless a suitable non-default font is chosen. This affects pretty much all non-native command line tools, not just Python.

answered Mar 31 '12 at 14:34

bobince

528,062
107
651
834

This is what I was looking for! I read the folder name and I don't know which encoding python was giving to me. I had to guess decoding with several codec names before I can get Unicode codepoints value out of it. This really solves the issues. – off99555 Jun 08 '16 at 20:21
In fact, python didn't infer any encoding for me. It just gave me bytes in the form of hexadecimal value and let me find the encoding of those filenames myself. – off99555 Jun 08 '16 at 20:32

score 2 · Accepted Answer · edited May 07 '12 at 17:38

2

Solved it: # -*- coding: utf-8 -*- at the top of the document solved it.

edited May 07 '12 at 17:38

Flexo

87,323
22
191
272

answered May 07 '12 at 11:58

mirandalol

445
1
7
16

This can't solve the problem as described. Something else had to have changed at the same time. That comment declares the source encoding only, and only affects source files with non-ASCII on Python 2. The example is only ASCII so this would have zero effect. More likely the OP also changed to a Unicode string in `listdir` at the same time. – Mark Tolonen Sep 03 '17 at 15:28

score 1 · Answer 3 · answered Mar 31 '12 at 17:10

The problem is your output console uses a cp1252 encoding per your error message, and Hebrew cannot be printed under that encoding. Use an IDE that supports UTF-8, and a font in that IDE that suports Hebrew and it will work correctly when using os.listdir with a Unicode path.

Here's an example from the PythonWin IDE with and without a Unicode path.

PythonWin 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32.
Portions Copyright 1994-2008 Mark Hammond - see 'Help/About PythonWin' for further copyright information.
>>> import os
>>> for f in os.listdir('.'):
...     print f
...     
x.exe
x.py
x.pyc
y.py
?????.xls
>>> for f in os.listdir(u'.'):
...     print f
...     
x.exe
x.py
x.pyc
y.py
המסמך.xls

Also note that an encoding declaration in your source file does nothing for generating output. It only declares what encoding the source file is saved in, which affects the ability to write non-ASCII characters in the source file.

listdir doesn't print non-english letters correctly

3 Answers3

Linked