read a file and try to remove all non UTF-8 chars

Question

I am trying to read a file and convert the string to a UTF-8 string, in order to remove some non utf-8 chars in the file string,

file_str = open(file_path, 'r').read()
file_str = file_str.decode('utf-8')

but I got the following error,

AttributeError: 'str' object has no attribute 'decode'

Update: I tried the code as suggested by the answer,

file_str = open(file_path, 'r', encoding='utf-8').read()

but it didn't eliminate the non utf-8 chars, so how to remove them?

are you using python 3? In which case all strings are already unicode objects. You don't need to decode. — Sid, Apr 05 '16 at 14:37
You are using Python 3; `open()` returned a file object that *already decoded to Unicode* for you. Python 3 `str` is the Unicode type, it has no `decode()` method because you can't decode Unicode any further. — Martijn Pieters, Apr 05 '16 at 14:38
Possible duplicate of ['str' object has no attribute 'decode' in Python3](http://stackoverflow.com/questions/26125141/str-object-has-no-attribute-decode-in-python3) — Reti43, Apr 05 '16 at 14:41
For coding matters...Please do mention Python version or tag it accordingly .. — Iron Fist, Apr 05 '16 at 17:45

score 4 · Accepted Answer · answered Apr 05 '16 at 14:40

4

Remove the .decode('utf8') call. Your file data has already been decoded, because in Python 3 the open() call with text mode (the default) returned a file object that decodes the data to Unicode strings for you.

You probably do want to add the encoding to the open() call to make this explicit. Otherwise Python uses a system default, and that may not be UTF-8:

file_str = open(file_path, 'r', encoding='utf8').read()

For example, on Windows, the default codec is almost certainly going to be wrong for UTF-8 data, but you won't see the problem until you try to read the text; you'd find your have a Mojibake as the UTF-8 data is decoded using CP1252 or a similar 8-bit codec.

See the open() function documentation for further details.

answered Apr 05 '16 at 14:40

Martijn Pieters

1,048,767
296
4,058
3,343

2

This merely reads a UTF-8 encoded file. The OP asks for a way to ignore non-UTF-8 characters while reading a file that contains some dirty characters. This answer causes a `UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 80189: invalid continuation byte` when a dirty character is read. – mareoraft Oct 12 '20 at 01:19
@mareoraft: The OP never properly specified what they meant by *non-UTF8 characters*, which they wanted to remove after decoding. For files with **some** invalid data, you could use a different error handler, such as `errors='ignore'` or `errors='escape'` or `errors='surrogateescape'`. See https://docs.python.org/3/library/codecs.html#error-handlers – Martijn Pieters Oct 12 '20 at 14:03
@mareoraft: do make 100% certain that the file is *meant* to be decodable as UTF-8. UTF-8 is resistant to some corruption, ignoring or replacing error values will not make subsequent correct data unreadable. – Martijn Pieters Oct 12 '20 at 14:05

score 2 · Answer 2 · answered Oct 12 '20 at 01:33

If you use

file_str = open(file_path, 'r', encoding='utf8', errors='ignore').read()

, then non-UTF-8 characters will essentially be ignored. Read the open() function documentation for details. The documentation has a section on the possible values for the errors parameter.

read a file and try to remove all non UTF-8 chars

2 Answers2

Linked