Text en/decoding issue

Question

I'm hoping someone can relieve me of my ignorance here: I'm using python 3.6.4 currently and I'm trying to convert strings to simple alphanumerics.

I've got the how mostly sorted until I get to characters with diacritics. It involves football team names so I'm looking to convert, by way of example, 1. FC Köln to 1fckoln. So:

import requests

c = requests.get(the_url)
content = c.text

#code here to extract team name into variable 'ht'

ht = simpname(ht)

def simpname(who):
    punct = "' .-/\°()"
    the_o = 'òóôõöÖøØ'

    for p in punct:
        if p in who:
            who = who.replace(p, '')

    if the_o in who:
        who = who.replace(the_o, 'o')

    who = who.lower()

    return who

(NB: code cut down for the example, I'm handling a, e, etc. in the same fashion)

The only problem here is that, in my example, the text is arriving as 1. FC KÃ¶ln. I know I've got a character encoding issue, but I can't seem to get it to the right state. Can someone suggest a way around my issue?

Solved! Thank you to @Idlehands and the commenters below for their advice. Below is the same code with the updates applied for future readers can see the difference.

import requests

incoming = requests.get(the_url)
cinput = incoming.content
cinput = cinput.decode('iso-8859-1')
cinput = str(cinput)

# more code, eventually extracts a team name under 'ht'

ht = simpname(ht)

...

def simpname(who):
    punct = "' .-/\°()"
    the_o = 'òóôõöÖøØ'

    # who is currently 1. FC KÃ¶ln

    who = who.encode('latin-1') # who becomes b'1. FC K\xc3\xb6ln'
    who = who.decode('utf-8')   # who becomes '1. FC Köln'

    for p in punct:
        if p in who:
            who = who.replace(p, '')

    for an_o in the_o:
        if an_o in who:
            who = who.replace(an_o, 'o')

    who = who.lower()

when you open your file, open it as bytes mode `'rb'` with `encoding='utf-8'`. Give it a try and see if it helps. — r.ook, Feb 03 '18 at 03:23
Ah, interesting point to include, I can confirm that it's a str '' when it reaches the function, but I'm pulling it from an online source using the requests module. I'm using pycharm as my IDE (recommended), it shows who as str '`1. FC KÃ¶ln`' when I step through the function. — Tim Hamilton, Feb 03 '18 at 03:42
Are you using `requests.texts` or `requests.contents` to retrieve the text? I would try decoding the bytes from `contents` with different encoding. — r.ook, Feb 03 '18 at 03:55
Ah ha! OK, another upvote for PyCharm, just discovered the info it gives you on objects whilst looking at how *.contents and *.text differ. I can now confirm that the *byte*-string is encoded as 'ISO-8859-1'. So I gather I just need to convert that to 'utf-8'? — Tim Hamilton, Feb 03 '18 at 04:03
@Idlehands there's no guessing, the RFC that controls encoding is very explicit about how websites should be decoded. I wish I could find the reference where I learned this - I believe the conclusion was that `requests` follows the RFC correctly. — Mark Ransom, Feb 03 '18 at 04:36
@MarkRansom thanks for correcting me, I've deleted my comment to avoid spreading misinformation. TimHamilton You might want to update your question with the changes for readability. — r.ook, Feb 03 '18 at 04:42
@TimHamilton you might also want to see this link: https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string — r.ook, Feb 03 '18 at 04:45
I think I've got it! Thank you @Idlehands, MarkRansom and t.m.adam (who is correct about the the_o not being iterated). In a moment I will update the original post to show how I've changed the code. (NB: SO will only let me @ one person, hence missing Mark and TMA). — Tim Hamilton, Feb 03 '18 at 04:49

Text en/decoding issue

0 Answers0