read line with .encode with utf8

Question

I read line from a file like:

The Little Big Things: 163 Wege zur Spitzenleistung (Dein Leben) (German Edition) (Peters, Tom)

Die virtuelle Katastrophe: So führen Sie Teams über Distanz zur Spitzenleistung (German Edition) (Thomas, Gary)

I read / encode them with:

title = line.encode('utf8')

but the output is:

b'Die virtuelle Katastrophe: So f\xc3\xbchren Sie Teams \xc3\xbcber Distanz zur Spitzenleistung (German Edition) (Thomas, Gary)'

b'The Little Big Things: 163 Wege zur Spitzenleistung (Dein Leben) (German Edition) (Peters, Tom)'

Why is the "b'" always added? How do I properly read the files so that the "Umlauts" are preserved?

Here is the complete relevant code snippet:

# Parse the clippings.txt file
lines = [line.strip() for line in codecs.open(config['CLIPPINGS_FILE'], 'r', 'utf-8-sig')]
for line in lines:
    line_count = line_count + 1
    if (line_count == 1 or is_title == 1):
        # ASSERT: this is a title line
        #title = line.encode('ascii', 'ignore')
        title = line.encode('utf8')
        prev_title = 1
        is_title = 0
        note_type_result = note_type = l = l_result = location = ""
        continue

thanks

`b''`means that you got a byte buffer, not an (unicode) string as is to be expected from `encode()`, which turns a string into an encoded byte sequence. In your case, you need to `decode()` *from* utf-8, not encode *to* utf-8. Or even better, use [`codecs.open(..., encoding='utf-8')`](https://docs.python.org/3/library/codecs.html#codecs.open). For a proper answer I'd like to see more of your code, though. — dhke, Jun 12 '16 at 11:07
@dhke It might be enough just to remove that `.encode` line, because the output looks like correct UTF-8, meaning `line` was already a valid Unicode string. — melpomene, Jun 12 '16 at 11:11
@melpomene That works, when the default encoding is *utf-8*, yes. — dhke, Jun 12 '16 at 11:16
@dhke I added the code. I already use codecs.open (was used in the base I used for changing this script to fit my needs). Using .decode or removing .encode results in an error — f0rd42, Jun 12 '16 at 11:18
@f0rd42 I see. And looking at the snippet, you should be able to simply drop the encode part altogether. At this point, `line` is already a (decoded) Python string. `'\xc3\xb'` is also correct *utf-8* for German `ü`. What did make you think, the umlauts are not read correctly? Do they display incorrectly on output? — dhke, Jun 12 '16 at 11:20
@f0rd42 "results in an error" is very unhelpful. What's the error message? — melpomene, Jun 12 '16 at 11:20
@melpomene `AttributeError: 'str' object has no attribute 'decode'` ;-). It's Python 3 and the Python 3 string doesn't have a `decode()`, because it's already decoded. — dhke, Jun 12 '16 at 11:22
@dhke OK, I can see that happening from using `.decode`, but what's the error from removing `.encode`? — melpomene, Jun 12 '16 at 11:22
just doing a "title = line" does anything I need. I looked the code as a basis for my needs. Thanks to you both — f0rd42, Jun 12 '16 at 11:25

score 5 · Answer 1 · answered Jun 12 '16 at 11:23

The method str.encode turns a unicode string into a bytes object:

str.encode(encoding="utf-8", errors="strict")
Return an encoded version of the string as a bytes object. Default encoding is 'utf-8'. errors may be given to set a different error handling scheme. The default for errors is 'strict', meaning that encoding errors raise a UnicodeError. Other possible values are 'ignore', 'replace', 'xmlcharrefreplace', 'backslashreplace' and any other name registered via codecs.register_error(), see section Error Handlers. For a list of possible encodings, see section Standard Encodings.

So what you get is exactly what is expected.

On most machines, you can just open the files and read. If the file encoding is not the system default, you can pass it as keyword argument:

with open(filename, encoding='utf8') as f:
    line = f.readline()

read line with .encode with utf8

1 Answers1