2

two questions: 1. why does

In [21]:                                                                                   
   ....:     for root, dir, file in os.walk(spath):
   ....:         print(root)

print the whole tree but

In [6]: for dirs in os.walk(spath):                             
...:     print(dirs)    

chokes on this unicode error?

UnicodeEncodeError: 'charmap' codec can't encode character '\u2122' in position 1477: character maps to <undefined>

[NOTE: this is the TM symbol]

  1. I looked at these answers

Scraping works well until I get this error: 'ascii' codec can't encode character u'\u2122' in position

What's the deal with Python 3.4, Unicode, different languages and Windows?

python 3.2 UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position 9629: character maps to <undefined>

https://github.com/Drekin/win-unicode-console

https://docs.python.org/3/search.html?q=IncrementalDecoder&check_keywords=yes&area=default

and tried these variations

----> 1 print(dirs, encoding='utf-8')                                                           
TypeError: 'encoding' is an invalid keyword argument for this function       
In [11]: >>> u'\u2122'.encode('ascii', 'ignore')                                                
Out[11]: b''                       

print(dirs).encode(‘utf=8’)

all to no effect.

This was done with python 3.4.3 and visual studio code 1.6.1 on Windows 10. The default settings in Visual Studio Code include:

// The default character set encoding to use when reading and writing files. "files.encoding": "utf8",

python 3.4.3 visual studio code 1.6.1 ipython 3.0.0

UPDATE EDIT I tried this again in the Sublime Text REPL, running a script. Here's what I got:

# -*- coding: utf-8 -*-
import os

spath = 'C:/Users/Semantic/Documents/Align' 

with open('os_walk4_align.txt', 'w') as f:
    for path, dirs, filenames in os.walk(spath):
        print(path, dirs, filenames, file=f)

Traceback (most recent call last):
File "listdir_test1.py", line 8, in <module>
print(path, dirs, filenames, file=f)
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2605' in position 300: character maps to <undefined>

This code is only 217 characters long, so where does ‘position 300’ come from?

Community
  • 1
  • 1
Malik A. Rumi
  • 1,855
  • 4
  • 25
  • 36
  • I assume you mean 'unicode', not 'unicorn'. I am testing out the new Visual Studio Code on Windows 10, that is why I am using it, and as I said, the default is already set to utf-8. Furthermore, I tried this in Sublime Text, and I am STILL getting unicode errors, albeit different ones. – Malik A. Rumi Oct 17 '16 at 21:29
  • Setting the source encoding (`#coding:utf8`) has nothing to do with the output encoding. As you can see from your error `cp1252` is the output encoding and doesn't support the characters being printed to the terminal. The easiest way around this is to write to a file with UTF-8 encoding insteading of printing to a display, or use an Python IDE that supports UTF-8 output. I'm not familiar with Sublime Text, but it probably has a way to adjust the output encoding as well. – Mark Tolonen Oct 17 '16 at 22:13

2 Answers2

3

Here's a test case:

C:\TEST
├───dir1
│       file1™
│
└───dir2
        file2

Here's a script (Python 3.x):

import os

spath = r'c:\test'

for root,dirs,files in os.walk(spath):
    print(root)

for dirs in os.walk(spath):                             
    print(dirs)

Here's the output, on an IDE that supports UTF-8 (PythonWin, in this case):

c:\test
c:\test\dir1
c:\test\dir2
('c:\\test', ['dir1', 'dir2'], [])
('c:\\test\\dir1', [], ['file1™'])
('c:\\test\\dir2', [], ['file2'])

Here's the output, on my Windows console, which defaults to cp437:

c:\test
c:\test\dir1
c:\test\dir2
('c:\\test', ['dir1', 'dir2'], [])
Traceback (most recent call last):
  File "C:\test.py", line 9, in <module>
    print(dirs)
  File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2122' in position 47: character maps to <undefined>

For Question 1, the reason print(root) works is that no directory had a character that wasn't supported by the output encoding, but print(dirs) is now printing a tuple containing (root,dirs,files) and one of the files has an unsupported character in the Windows console.

For Question 2, the first example misspelled utf-8 as utf=8, and the second example didn't declare an encoding for the file the output was written to, so it used a default that didn't support the character.

Try this:

import os

spath = r'c:\test'

with open('os_walk4_align.txt', 'w', encoding='utf8') as f:
    for path, dirs, filenames in os.walk(spath):
        print(path, dirs, filenames, file=f)

Content of os_walk4_align.txt, encoded in UTF-8:

c:\test ['dir1', 'dir2'] []
c:\test\dir1 [] ['file1™']
c:\test\dir2 [] ['file2']
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • Ding! Ding! Ding! We have a winner! My own '=' typo aside, I had the encoding on the print line, when it should have been in the arguments. Your detailed answer helped a great deal. Thanks. – Malik A. Rumi Oct 18 '16 at 04:40
-1

The console you're outputting to doesn't support non-ASCII by default. You need to use str.encode('utf-8').

That works on strings not on lists. So print(dirs).encode(‘utf=8’) won't works, and it's utf-8, not utf=8.

Print your lists with list comprehension like:

>>> print([s.encode('utf-8') for s in ['a', 'b']])
['a', 'b']
>>> print([d.encode('utf-8') for d in dirs])  # to print `dirs`
aneroid
  • 12,983
  • 3
  • 36
  • 66