1

I am trying to read a file and convert the string to a UTF-8 string, in order to remove some non utf-8 chars in the file string,

file_str = open(file_path, 'r').read()
file_str = file_str.decode('utf-8')

but I got the following error,

AttributeError: 'str' object has no attribute 'decode'

Update: I tried the code as suggested by the answer,

file_str = open(file_path, 'r', encoding='utf-8').read()

but it didn't eliminate the non utf-8 chars, so how to remove them?

Iron Fist
  • 10,739
  • 2
  • 18
  • 34
daiyue
  • 7,196
  • 25
  • 82
  • 149
  • are you using python 3? In which case all strings are already unicode objects. You don't need to decode. – Sid Apr 05 '16 at 14:37
  • 3
    You are using Python 3; `open()` returned a file object that *already decoded to Unicode* for you. Python 3 `str` is the Unicode type, it has no `decode()` method because you can't decode Unicode any further. – Martijn Pieters Apr 05 '16 at 14:38
  • Possible duplicate of ['str' object has no attribute 'decode' in Python3](http://stackoverflow.com/questions/26125141/str-object-has-no-attribute-decode-in-python3) – Reti43 Apr 05 '16 at 14:41
  • For coding matters...Please do mention Python version or tag it accordingly .. – Iron Fist Apr 05 '16 at 17:45

2 Answers2

4

Remove the .decode('utf8') call. Your file data has already been decoded, because in Python 3 the open() call with text mode (the default) returned a file object that decodes the data to Unicode strings for you.

You probably do want to add the encoding to the open() call to make this explicit. Otherwise Python uses a system default, and that may not be UTF-8:

file_str = open(file_path, 'r', encoding='utf8').read()

For example, on Windows, the default codec is almost certainly going to be wrong for UTF-8 data, but you won't see the problem until you try to read the text; you'd find your have a Mojibake as the UTF-8 data is decoded using CP1252 or a similar 8-bit codec.

See the open() function documentation for further details.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • 2
    This merely reads a UTF-8 encoded file. The OP asks for a way to ignore non-UTF-8 characters while reading a file that contains some dirty characters. This answer causes a `UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 80189: invalid continuation byte` when a dirty character is read. – mareoraft Oct 12 '20 at 01:19
  • @mareoraft: The OP never properly specified what they meant by *non-UTF8 characters*, which they wanted to remove after decoding. For files with **some** invalid data, you could use a different error handler, such as `errors='ignore'` or `errors='escape'` or `errors='surrogateescape'`. See https://docs.python.org/3/library/codecs.html#error-handlers – Martijn Pieters Oct 12 '20 at 14:03
  • @mareoraft: do make 100% certain that the file is *meant* to be decodable as UTF-8. UTF-8 is resistant to some corruption, ignoring or replacing error values will not make subsequent correct data unreadable. – Martijn Pieters Oct 12 '20 at 14:05
2

If you use

file_str = open(file_path, 'r', encoding='utf8', errors='ignore').read()

, then non-UTF-8 characters will essentially be ignored. Read the open() function documentation for details. The documentation has a section on the possible values for the errors parameter.

mareoraft
  • 3,474
  • 4
  • 26
  • 62