0

On my quest to better learn python 3.4, I decided to create a 'practical' program that simply reads the RSS feed of a link you give it. I was testing using the CNN RSS feed and got the description to print, but the description also contains a lot of "crap" I don't need, is there a quick way to remove the unnecessary text? Basically I want to keep "A deal to sell the Los Angeles Clippers for an NBA record price may move forward, a California probate judge ruled Monday." and remove everything else. Thanks.

Full Rss tag:

<description>A deal to sell the Los Angeles Clippers for an NBA record price may move forward, a California probate judge ruled Monday.&lt;div class="feedflare"&gt;
&lt;a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=FMi4oVkdS58:sssPw82MBtA:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=FMi4oVkdS58:sssPw82MBtA:7Q72WNTAKBA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=7Q72WNTAKBA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=FMi4oVkdS58:sssPw82MBtA:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?i=FMi4oVkdS58:sssPw82MBtA:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=FMi4oVkdS58:sssPw82MBtA:qj6IDK7rITs"&gt;&lt;img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?d=qj6IDK7rITs" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://rss.cnn.com/~ff/rss/cnn_topstories?a=FMi4oVkdS58:sssPw82MBtA:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/rss/cnn_topstories?i=FMi4oVkdS58:sssPw82MBtA:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/rss/cnn_topstories/~4/FMi4oVkdS58" height="1" width="1"/&gt;</description>
Tanishq dubey
  • 1,522
  • 7
  • 19
  • 42

1 Answers1

0

"Is there a quick way," you ask? Maybe.

First off, take a look at what you're really getting back by copying the entire bit of text you've given us and running it through this online HTML decoder:

http://www.opinionatedgeek.com/DotNet/Tools/HTMLEncode/Decode.aspx

That should give you an idea of what you're dealing with. You need to decode the text so that it looks like proper HTML. You'll then see that, nested inside the description tag, you have a div tag and an img tag following the text you're interested in. If you believe that this is consitently what you get back from the feed, then it's safe to capture everything before the <div> and toss the rest away.

Take a look at this answer, regarding decoding HTML:

https://stackoverflow.com/a/2087433/155167

Once you've decoded the HTML, you can probably just use the find method of string objects.

# Assume text is decoded HTML, so the <div> looks like a normal tag.
start = len('<description>')
end = text.find('<div>')
text = text[start: end]
Community
  • 1
  • 1
Mario
  • 2,397
  • 2
  • 24
  • 41