0

I want to build an RSS Feed Reader by myself. So I started up.

My Test Page, from where I get my feed is 'http://heise.de.feedsportal.com/c/35207/f/653902/index.rss'.

It is a German page , because of that I choose as decoding "iso-8859-1". So here is the code.

def main():
counter = 0
try:
    page = 'http://heise.de.feedsportal.com/c/35207/f/653902/index.rss'
    sourceCode = opener.open(page).read().decode('iso-8859-1')
except Exception as e:
    print(str(e))
    #print sourceCode
try:
    titles = re.findall(r'<title>(.*?)</title>',sourceCode)
    links = re.findall(r'<link>(.*?)</link>',sourceCode)
except Exception as e:
    print(str(e))     
rssFeeds = []
for link in links:
    if "rss." in link:
        rssFeeds.append(link)
for feed in rssFeeds:
    if ('html' in feed) or ('htm' in feed):
        try:
            print("Besuche " + feed+ ":")
            feedSource = opener.open(feed).read().decode("iso-8859-1","replace")
        except Exception as e:
            print(str(e))   
        content = re.findall(r'<p>(.*?)</p>', feedSource)
        try:
            tempTxt = open("feed" + str(counter)+".txt", "w")
            for line in content:
                tempTxt.write(tagFilter(line))
        except Exception as e:
            print(str(e))
        finally:
            tempTxt.close()
            counter += 1
            time.sleep(10)
  1. First of all I start by opening the website I mentioned before. And so far there seems not to be any problem with opening it.
  2. After decoding the website I search in it for all expression which are inside a Link Tags.
  3. Now I select those links which have "rss" in them. Which get stored in a new list.
  4. With the new list, I start opening the links and search there fore there content.

And now start the problems. I decode those sides, still german sides, and I get errors like:

  • 'charmap' codec can't encode character '\x9f' in position 339: character maps to
  • 'charmap' codec can't encode character '\x9c' in position 43: character maps to
  • 'charmap' codec can't encode character '\x80' in position 131: character maps to

And I really have no Idea why it won't work. The data which is collected before the error appears gets written into an textfile.

Example for collected data:

Einloggen auf heise onlineTopthemen:Nachdem Google Anfang des Monats eine 64-Bit-Beta seines hauseigenen Browsers Chrome für Windows 7 und Windows 8 vorgestellt hatte, kümmert sich der Internetriese nun auch um OS X. Wie Tester melden, verbreitet Google über seine Canary-/Dev-Kanäle für Entwickler und Early Adopter nun automatisch 64-Bit-Builds, wenn der User über einen kompatiblen Rechner verfügt.

I hope someone can help me. Also other clues or information which will help me build my own rss feed reader are welcome.

Greetings Templum

Templum
  • 183
  • 1
  • 12
  • Try UTF-8 instead of iso-8859-1. – miko Aug 06 '14 at 13:05
  • You can't just make up an encoding. That page is encoded in utf-8, and the XML starts with a declaration that says just that. – Wooble Aug 06 '14 at 13:05
  • And please read http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags before you even think of parsing XML with regular expressions. – Wooble Aug 06 '14 at 13:06
  • @Wooble and also miko this for mentioning the UTF-8 thing. But I thought as it is an german side and they using special chars like ü and ä. It was just a mistake..... – Templum Aug 06 '14 at 13:24
  • What do you mean "except for the German parts"? UTF-8 can encode all of the unicode codepoints. I suggest you read http://bit.ly/unipain – Wooble Aug 06 '14 at 13:28
  • @Wooble nevermind i mean there come a new error up character '\u0308' in position 139: character maps to . But its mainly because it parses far more than the article. – Templum Aug 06 '14 at 13:32

1 Answers1

2

Per miko and Wooble's comment:

iso-8859-1 should be utf-8 since the XML returned says the encoding is utf-8:

In [71]: sourceCode = opener.open(page).read()

In [72]: sourceCode[:100]
Out[72]: "<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet type='text/xsl' href='http://heise.de.feedspo"

and you really ought to be using an XML parser like lxml or BeautifulSoup to parse XML. It's more error prone to be using only the re module.


feedSource is a unicode since it is the result of a decoding:

        feedSource = opener.open(feed).read().decode("utf-8","replace")

Therefore, line is also unicode:

    content = re.findall(r'<p>(.*?)</p>', feedSource)
    for line in content:
        ...

tempTxt is a plain file handle (as opposed to one opened with io.open, which takes an encoding parameter). So tempTxt expects bytes (e.g. a str), not unicode.

So encode the line before writing to the file:

        for line in content:
            tempTxt.write(line.encode('utf-8'))

or define tempTxt using io.open and specify an encoding:

import io
with io.open(filename, "w", encoding='utf-8') as tempTxt:
    for line in content:
        tempTxt.write(line)

By the way, it's not good to catch all Exceptions unless you are ready to handle all Exceptions:

    except Exception as e:
        print(str(e))   

and furthermore, if you only print the error message, then Python may execute subsequent code even though variables defined in the try section are undefined. For example,

    try:
        print("Besuche " + feed+ ":")
        feedSource = opener.open(feed).read().decode("iso-8859-1","replace")
    except Exception as e:
        print(str(e))   
    content = re.findall(r'<p>(.*?)</p>', feedSource)

using feedSource in the call to re.findall may raise a NameError if an exception was raised before feedSource was defined.

You might want to add a continue statement in the except-suite if you want Python to pass over this feed and move on to the next:

    except Exception as e:
        print(str(e))   
        continue
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thx for your clue with the continue didn't know it before. And also thanks so much for your effort – Templum Aug 06 '14 at 13:28