2

I'm working on a small project in Python 3 where I have to scan a drive full of files and output a .txt file with the path of all of the files inside the drive. The problem is that some of the files are in Brazilian Portuguese which has "accented letters" such as "não", "você" and others and those special letters are being output wrongly in the final .txt.

The code is just these few lines below:

import glob

path = r'path/path'

files = [f for f in glob.glob(path + "**/**", recursive=True)]

with open("file.txt", 'w') as output:
    for row in files:
        output.write(str(row.encode('utf-8') )+ '\n')

An example of outputs

path\folder1\Treino_2.doc
path\folder1\Treino_1.doc
path\folder1\\xc3\x81gua de Produ\xc3\xa7\xc3\xa3o.doc

The last line show how some of the ouputs are wrong since x81gua de Produ\xc3\xa7\xc3\xa3o should be Régua de Produção

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • Opening a file as text is *enough*. You probably want to pick an encoding for the file, with the `encoding` argument to `open()`. You don't want to encode manually. – Martijn Pieters Aug 15 '20 at 15:45
  • 2
    `str(row.encode('utf-8') )` encodes the value to bytes, then the bytes object to a string, and you **don't want to do that**. – Martijn Pieters Aug 15 '20 at 15:46
  • When I don't do it, some times I got the error `UnicodeEncodeError: 'charmap' codec can't encode` – Pedro Enrique Andrade Aug 15 '20 at 15:52
  • That's an issue with your default file encoding. – Martijn Pieters Aug 15 '20 at 15:55
  • For future reference, **include that information in your question**. Also, the output your code isn't accurately reproduced, you also are writing `b'...'` in those files, so a lowercase letter `b` and quotes. Those are the result of using `str()` on the bytes objects. – Martijn Pieters Aug 15 '20 at 15:58
  • `str(row.encode('utf-8') )` should produce the string representation of a `bytes` object like `b'path\folder1\Treino_2.doc'`, Your output doc shouldn't have just the filenames. – tdelaney Aug 15 '20 at 15:59
  • Which line gets the 'charmap' error? You could `print(sys.getdefaultencoding(), sys.getfilesystemencoding())` to get more information about your system. The environment matters for reading the file in other tools. Suppose you are on Windows with a codepage and you try open a utf-8 encoded document in notepad or something. It will interpret the extended utf-8 characters as code page characters and display the wrong thing. – tdelaney Aug 15 '20 at 16:04
  • @tdelaney: when they don't use `str(....encode("utf8"))`. Meaning that the default encoding used when opening a file is an ANSI codepage, not UTF8. To see what codec that is, you want to look at [`locale.getpreferredencoding()`](https://docs.python.org/3/library/locale.html#locale.getpreferredencoding) – Martijn Pieters Aug 15 '20 at 16:07
  • @MartijnPieters - The default encoding is whatever `sys.getdefaultencoding()` says. OP's code doesn't write the output claimed, and if its changed to write utf-8 properly, it still might look incorrect when opened in other tools that use a different encoding. This is a common problem on Windows where writing utf-16 with a BOM may be a good choice. – tdelaney Aug 15 '20 at 16:13
  • @MartijnPieters - `locale.getpreferredencoding()` is interesting. I don't know how it differs from `sys.getdefaultencoding()`. – tdelaney Aug 15 '20 at 16:14
  • @tdelaney: no, sorry, you got that wrong. `sys.getdefaultencoding()` is [**hardwired to UTF8 in Python 3**](https://github.com/python/cpython/blob/0a5b30d98913e84f80ecea2b861e96d8f67c89e9/Objects/unicodeobject.c#L4104-L4108). It's there for compatibility with Python 2, mostly. It has nothing to do with the default encoding for *opened files*. – Martijn Pieters Aug 15 '20 at 16:16
  • @tdelaney: This is detailed in the [`open()` function documentation](https://docs.python.org/3/library/functions.html#open): *In text mode, if encoding is not specified the encoding used is platform dependent: `locale.getpreferredencoding(False)` is called to get the current locale encoding* – Martijn Pieters Aug 15 '20 at 16:18
  • @MartijnPieters - I just dug it up in the source and it is indeed hard coded to UTF-8. The help for `sys.getdefaultencoding` says _Return the current default string encoding used by the Unicode implementation._ which I always assumed was the same default encoding referred to in the `open` documentation. – tdelaney Aug 15 '20 at 16:32
  • @tdelaney: `sys.getdefaultencoding()` is the encoding used when you use `strvalue.encode()` with no argument. In Python 2, which had *implicit* string to bytes encoding and vice versa, this used to be configurable, then saw the ability to set the codec disabled during the Python start-up process (and people re-enabling it with hacks), and a [bit of a controversial issue](https://stackoverflow.com/q/3828723). Files (streams) are part of the locale, and so their default encoding is governed by the locale settings. This applies to `stdin`, `stdout` and `stderr` too. – Martijn Pieters Aug 15 '20 at 17:06

2 Answers2

5

Python files handle Unicode text (including Brazilian accented characters) directly. All you need to do is using the file in text mode, which is the default unless you explicitly ask open() to give you a binary file. "w" gives you a text file that's writable.

You may want to be explicit about the encoding, however, by using the encoding argument for the open() function:

with open("file.txt", "w", encoding="utf-8") as output:
    for row in files:
        output.write(row + "\n")

If you don't explicitly set the encoding, then a system-specific default is selected. Not all encodings can encode all possible Unicode codepoints. This happens on Windows more than on other operating systems, where the default ANSI codepage then leads to charmap codepage can't encode character errors, but it can happen on other Operating Systems as well if the current locale is configured to use a non-Unicode encoding.

Do not encode to bytes and then convert the resulting bytes object back to a string again with str(). That only makes a big mess with string representations and escapes and the b prefix there too:

>>> path = r"path\folder1\Água de Produção.doc"
>>> v.encode("utf8")  # bytes are represented with the "b'...'" syntax
b'path\\folder1\\\xc3\x81gua de Produ\xc3\xa7\xc3\xa3o.doc'
>>> str(v.encode("utf8"))  # converting back with `str()` includes that syntax
"b'path\\\\folder1\\\\\\xc3\\x81gua de Produ\\xc3\\xa7\\xc3\\xa3o.doc'"

See What does a b prefix before a python string mean? for more details as to what happens here.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
3

You probably just want to write the filename strings directly to the file, without first encoding them as UTF-8, since they already are in such an encoding. That is:

…
    for row in files:
        output.write(row + '\n')

Should do the right thing.


I say “probably” since filenames do not have to be valid UTF-8 in some operating systems (e.g. Linux!), and treating those as UTF-8 will fail. In that case your only recourse is to handle the filenames as raw byte sequences — however, this won’t ever happen in your code, since glob already returns strings rather than byte arrays, i.e. Python has already attempted to decode the byte sequences representing the filenames as UTF-8.

You can tell glob to handle arbitrary byte filenames (i.e. non-UTF-8) by passing the globbing pattern as a byte sequence. On Linux, the following works:

filename = b'\xbd\xb2=\xbc \xe2\x8c\x98'

with open(filename, 'w') as file:
    file.write('hi!\n')

import glob
print(glob.glob(b'*')[0])
# b'\xbd\xb2=\xbc \xe2\x8c\x98'

# BUT:
print(glob.glob('*')[0])
#---------------------------------------------------------------------------
#UnicodeEncodeError                        Traceback (most recent call last)
#<ipython-input-12-2bce790f5243> in <module>
#----> 1 print(glob.glob('*')[0])
#
#UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed
Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • Your advice for `glob()` is a red herring. They do **not** have an issue with `glob()`. – Martijn Pieters Aug 15 '20 at 16:05
  • @MartijnPieters I’m aware. The specific reply to OP is the first part of my answer. The second part of my answer is a general FYI that easily 99% of all Python code gets wrong (and the same in many other languages, to be fair). – Konrad Rudolph Aug 15 '20 at 16:06
  • Given that the user is on Windows, and presumably using [a recent Python release](https://docs.python.org/3/whatsnew/3.6.html#pep-529-change-windows-filesystem-encoding-to-utf-8), this is not likely an issue for them. – Martijn Pieters Aug 15 '20 at 16:13