Problems with encoding Website in Python. Getting 'charmap' codec can't encode character '\x9f' in position

Question

I want to build an RSS Feed Reader by myself. So I started up.

My Test Page, from where I get my feed is 'http://heise.de.feedsportal.com/c/35207/f/653902/index.rss'.

It is a German page , because of that I choose as decoding "iso-8859-1". So here is the code.

def main():
counter = 0
try:
    page = 'http://heise.de.feedsportal.com/c/35207/f/653902/index.rss'
    sourceCode = opener.open(page).read().decode('iso-8859-1')
except Exception as e:
    print(str(e))
    #print sourceCode
try:
    titles = re.findall(r'<title>(.*?)</title>',sourceCode)
    links = re.findall(r'<link>(.*?)</link>',sourceCode)
except Exception as e:
    print(str(e))     
rssFeeds = []
for link in links:
    if "rss." in link:
        rssFeeds.append(link)
for feed in rssFeeds:
    if ('html' in feed) or ('htm' in feed):
        try:
            print("Besuche " + feed+ ":")
            feedSource = opener.open(feed).read().decode("iso-8859-1","replace")
        except Exception as e:
            print(str(e))   
        content = re.findall(r'<p>(.*?)</p>', feedSource)
        try:
            tempTxt = open("feed" + str(counter)+".txt", "w")
            for line in content:
                tempTxt.write(tagFilter(line))
        except Exception as e:
            print(str(e))
        finally:
            tempTxt.close()
            counter += 1
            time.sleep(10)

First of all I start by opening the website I mentioned before. And so far there seems not to be any problem with opening it.
After decoding the website I search in it for all expression which are inside a Link Tags.
Now I select those links which have "rss" in them. Which get stored in a new list.
With the new list, I start opening the links and search there fore there content.

And now start the problems. I decode those sides, still german sides, and I get errors like:

'charmap' codec can't encode character '\x9f' in position 339: character maps to
'charmap' codec can't encode character '\x9c' in position 43: character maps to
'charmap' codec can't encode character '\x80' in position 131: character maps to

And I really have no Idea why it won't work. The data which is collected before the error appears gets written into an textfile.

Example for collected data:

Einloggen auf heise onlineTopthemen:Nachdem Google Anfang des Monats eine 64-Bit-Beta seines hauseigenen Browsers Chrome für Windows 7 und Windows 8 vorgestellt hatte, kümmert sich der Internetriese nun auch um OS X. Wie Tester melden, verbreitet Google über seine Canary-/Dev-Kanäle für Entwickler und Early Adopter nun automatisch 64-Bit-Builds, wenn der User über einen kompatiblen Rechner verfügt.

I hope someone can help me. Also other clues or information which will help me build my own rss feed reader are welcome.

Greetings Templum

You can't just make up an encoding. That page is encoded in utf-8, and the XML starts with a declaration that says just that. — Wooble, Aug 06 '14 at 13:05
And please read http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags before you even think of parsing XML with regular expressions. — Wooble, Aug 06 '14 at 13:06
@Wooble and also miko this for mentioning the UTF-8 thing. But I thought as it is an german side and they using special chars like ü and ä. It was just a mistake..... — Templum, Aug 06 '14 at 13:24
What do you mean "except for the German parts"? UTF-8 can encode all of the unicode codepoints. I suggest you read http://bit.ly/unipain — Wooble, Aug 06 '14 at 13:28
@Wooble nevermind i mean there come a new error up character '\u0308' in position 139: character maps to . But its mainly because it parses far more than the article. — Templum, Aug 06 '14 at 13:32

unutbu · Accepted Answer · 2014-08-06T13:22:34.927

Per miko and Wooble's comment:

iso-8859-1 should be utf-8 since the XML returned says the encoding is utf-8:

In [71]: sourceCode = opener.open(page).read()

In [72]: sourceCode[:100]
Out[72]: "<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet type='text/xsl' href='http://heise.de.feedspo"

and you really ought to be using an XML parser like lxml or BeautifulSoup to parse XML. It's more error prone to be using only the re module.

feedSource is a unicode since it is the result of a decoding:

        feedSource = opener.open(feed).read().decode("utf-8","replace")

Therefore, line is also unicode:

    content = re.findall(r'<p>(.*?)</p>', feedSource)
    for line in content:
        ...

tempTxt is a plain file handle (as opposed to one opened with io.open, which takes an encoding parameter). So tempTxt expects bytes (e.g. a str), not unicode.

So encode the line before writing to the file:

        for line in content:
            tempTxt.write(line.encode('utf-8'))

or define tempTxt using io.open and specify an encoding:

import io
with io.open(filename, "w", encoding='utf-8') as tempTxt:
    for line in content:
        tempTxt.write(line)

By the way, it's not good to catch all Exceptions unless you are ready to handle all Exceptions:

    except Exception as e:
        print(str(e))

and furthermore, if you only print the error message, then Python may execute subsequent code even though variables defined in the try section are undefined. For example,

    try:
        print("Besuche " + feed+ ":")
        feedSource = opener.open(feed).read().decode("iso-8859-1","replace")
    except Exception as e:
        print(str(e))   
    content = re.findall(r'<p>(.*?)</p>', feedSource)

using feedSource in the call to re.findall may raise a NameError if an exception was raised before feedSource was defined.

You might want to add a continue statement in the except-suite if you want Python to pass over this feed and move on to the next:

    except Exception as e:
        print(str(e))   
        continue

Thx for your clue with the continue didn't know it before. And also thanks so much for your effort — Templum, Aug 06 '14 at 13:28

Problems with encoding Website in Python. Getting 'charmap' codec can't encode character '\x9f' in position

1 Answers1