Python RSS reader text filtering

Question

On my quest to better learn python 3.4, I decided to create a 'practical' program that simply reads the RSS feed of a link you give it. I was testing using the CNN RSS feed and got the description to print, but the description also contains a lot of "crap" I don't need, is there a quick way to remove the unnecessary text? Basically I want to keep "A deal to sell the Los Angeles Clippers for an NBA record price may move forward, a California probate judge ruled Monday." and remove everything else. Thanks.

Full Rss tag:

<description>A deal to sell the Los Angeles Clippers for an NBA record price may move forward, a California probate judge ruled Monday.&lt;div class="feedflare"&gt;
&lt;a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=FMi4oVkdS58:sssPw82MBtA:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=FMi4oVkdS58:sssPw82MBtA:7Q72WNTAKBA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=7Q72WNTAKBA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=FMi4oVkdS58:sssPw82MBtA:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?i=FMi4oVkdS58:sssPw82MBtA:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=FMi4oVkdS58:sssPw82MBtA:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=FMi4oVkdS58:sssPw82MBtA:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?i=FMi4oVkdS58:sssPw82MBtA:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/rss/cnn_topstories/~4/FMi4oVkdS58" height="1" width="1"/&gt;</description>

How do you decide which parts of the text are necessary and unnecessary? A regex like `([^&]*)` would extract what you want from your current example, but may not work in all cases. — jonrsharpe, Jul 29 '14 at 14:44
@JoYSword That is what I want to do, but how do I accomplish that? The tag I provided is given by CNN, I am parsing it and printing it out in a terminal window. — Tanishq dubey, Jul 29 '14 at 14:45
@jonrsharpe The "useless" text always starts with the text '<' continues through the description tag. — Tanishq dubey, Jul 29 '14 at 14:48
Then that regex could do it - it takes all text between the opening tag and the first `'&'`. Or use a lookahead for more robust results: `(.*)(?=<)`. See [here](http://regex101.com/r/hT4sH2/1). — jonrsharpe, Jul 29 '14 at 14:49
@PadraicCunningham [Here is the RSS](http://rss.cnn.com/rss/cnn_topstories.rss), you might have to do a 'CTRL-U' to see the source though. — Tanishq dubey, Jul 29 '14 at 14:53
@PadraicCunningham It would be helpful if you could elaborate... — Tanishq dubey, Jul 29 '14 at 15:00
@jonrsharpe if there are some html tags inside the useful part this will not succeed.`Joel shuts down <em>stackoverflow</em><[useless part]>` — JoYSword, Jul 29 '14 at 15:07

score 0 · Accepted Answer · edited May 23 '17 at 12:06

"Is there a quick way," you ask? Maybe.

First off, take a look at what you're really getting back by copying the entire bit of text you've given us and running it through this online HTML decoder:

http://www.opinionatedgeek.com/DotNet/Tools/HTMLEncode/Decode.aspx

That should give you an idea of what you're dealing with. You need to decode the text so that it looks like proper HTML. You'll then see that, nested inside the description tag, you have a div tag and an img tag following the text you're interested in. If you believe that this is consitently what you get back from the feed, then it's safe to capture everything before the <div> and toss the rest away.

Take a look at this answer, regarding decoding HTML:

https://stackoverflow.com/a/2087433/155167

Once you've decoded the HTML, you can probably just use the find method of string objects.

# Assume text is decoded HTML, so the <div> looks like a normal tag.
start = len('<description>')
end = text.find('<div>')
text = text[start: end]

Python RSS reader text filtering

1 Answers1