1

I read line from a file like:

The Little Big Things: 163 Wege zur Spitzenleistung (Dein Leben) (German Edition) (Peters, Tom)

Die virtuelle Katastrophe: So führen Sie Teams über Distanz zur Spitzenleistung (German Edition) (Thomas, Gary)

I read / encode them with:

title = line.encode('utf8')

but the output is:

b'Die virtuelle Katastrophe: So f\xc3\xbchren Sie Teams \xc3\xbcber Distanz zur Spitzenleistung (German Edition) (Thomas, Gary)'

b'The Little Big Things: 163 Wege zur Spitzenleistung (Dein Leben) (German Edition) (Peters, Tom)'

Why is the "b'" always added? How do I properly read the files so that the "Umlauts" are preserved?

Here is the complete relevant code snippet:

# Parse the clippings.txt file
lines = [line.strip() for line in codecs.open(config['CLIPPINGS_FILE'], 'r', 'utf-8-sig')]
for line in lines:
    line_count = line_count + 1
    if (line_count == 1 or is_title == 1):
        # ASSERT: this is a title line
        #title = line.encode('ascii', 'ignore')
        title = line.encode('utf8')
        prev_title = 1
        is_title = 0
        note_type_result = note_type = l = l_result = location = ""
        continue

thanks

Rory Daulton
  • 21,934
  • 6
  • 42
  • 50
f0rd42
  • 1,429
  • 4
  • 19
  • 30
  • 1
    `b''`means that you got a byte buffer, not an (unicode) string as is to be expected from `encode()`, which turns a string into an encoded byte sequence. In your case, you need to `decode()` *from* utf-8, not encode *to* utf-8. Or even better, use [`codecs.open(..., encoding='utf-8')`](https://docs.python.org/3/library/codecs.html#codecs.open). For a proper answer I'd like to see more of your code, though. – dhke Jun 12 '16 at 11:07
  • 1
    @dhke It might be enough just to remove that `.encode` line, because the output looks like correct UTF-8, meaning `line` was already a valid Unicode string. – melpomene Jun 12 '16 at 11:11
  • @melpomene That works, when the default encoding is *utf-8*, yes. – dhke Jun 12 '16 at 11:16
  • @dhke I added the code. I already use codecs.open (was used in the base I used for changing this script to fit my needs). Using .decode or removing .encode results in an error – f0rd42 Jun 12 '16 at 11:18
  • 1
    @f0rd42 I see. And looking at the snippet, you should be able to simply drop the encode part altogether. At this point, `line` is already a (decoded) Python string. `'\xc3\xb'` is also correct *utf-8* for German `ü`. What did make you think, the umlauts are not read correctly? Do they display incorrectly on output? – dhke Jun 12 '16 at 11:20
  • @f0rd42 "results in an error" is very unhelpful. What's the error message? – melpomene Jun 12 '16 at 11:20
  • 1
    @melpomene `AttributeError: 'str' object has no attribute 'decode'` ;-). It's Python 3 and the Python 3 string doesn't have a `decode()`, because it's already decoded. – dhke Jun 12 '16 at 11:22
  • @dhke OK, I can see that happening from using `.decode`, but what's the error from removing `.encode`? – melpomene Jun 12 '16 at 11:22
  • 1
    just doing a "title = line" does anything I need. I looked the code as a basis for my needs. Thanks to you both – f0rd42 Jun 12 '16 at 11:25

1 Answers1

5

The method str.encode turns a unicode string into a bytes object:

str.encode(encoding="utf-8", errors="strict")
Return an encoded version of the string as a bytes object. Default encoding is 'utf-8'. errors may be given to set a different error handling scheme. The default for errors is 'strict', meaning that encoding errors raise a UnicodeError. Other possible values are 'ignore', 'replace', 'xmlcharrefreplace', 'backslashreplace' and any other name registered via codecs.register_error(), see section Error Handlers. For a list of possible encodings, see section Standard Encodings.

So what you get is exactly what is expected.

On most machines, you can just open the files and read. If the file encoding is not the system default, you can pass it as keyword argument:

with open(filename, encoding='utf8') as f:
    line = f.readline()
MaxNoe
  • 14,470
  • 3
  • 41
  • 46