I'm using Python3.4 and Beautiful Soup 4 to get some data of a RSS XML feed.
Everything seems to work fine, but sometimes it behaves not as expected because is not getting all the data inside the <description>
tag from at least one item on the list.
For example, this is the item that is giving me problems:
<item>
<title>Google’s first DeepMind AI health project is missing something</title>
<link>http://thenextweb.com/google/2016/02/25/googles-first-deepmind-ai-health-project-is-missing-something/</link>
<comments>http://thenextweb.com/google/2016/02/25/googles-first-deepmind-ai-health-project-is-missing-something/#respond</comments>
<pubDate>Thu, 25 Feb 2016 11:36:56 +0000</pubDate>
<dc:creator><![CDATA[Kirsty Styles]]></dc:creator>
<category><![CDATA[Google]]></category>
<category><![CDATA[Insider]]></category>
<category><![CDATA[Deepmind]]></category>
<category><![CDATA[doctor]]></category>
<category><![CDATA[healthcare]]></category>
<category><![CDATA[NHS]]></category>
<category><![CDATA[UK]]></category>
<guid isPermaLink="false">http://thenextweb.com/?p=957096</guid>
<description><![CDATA[<img width="520" height="245" src="http://cdn1.tnwcdn.com/wp-content/blogs.dir/1/files/2014/04/doctor-crop-520x245.jpg" alt="Doctors Seek Higher Fees From Health Insurers" title="Google's first DeepMind AI health project is missing something" data-id="750745" /><br />Having been down at Google’s DeepMind office earlier this week its man vs AI machine gaming competition preview, I was tipped off that a potentially-more-serious healthcare announcement would follow soon. That it has, but contrary to what the company’s remit might suggest, this project doesn’t actually contain any artificial intelligence at launch. “To date, no machine learning has been involved in these projects,” the company said. “While there is obvious potential in applying machine learning to these kinds of complex challenges, any decision to do so will led by clinicians.” DeepMind has announced an acquisition in the shape of an Imperial College London… <br><br><a href="http://thenextweb.com/google/2016/02/25/googles-first-deepmind-ai-health-project-is-missing-something/?utm_source=social&utm_medium=feed&utm_campaign=profeed">This story continues</a> at The Next Web]]></description>
<wfw:commentRss>http://thenextweb.com/google/2016/02/25/googles-first-deepmind-ai-health-project-is-missing-something/feed/</wfw:commentRss>
<slash:comments>0</slash:comments>
<enclosure url="http://cdn1.tnwcdn.com/wp-content/blogs.dir/1/files/2014/04/doctor-crop-520x245.jpg" type="image/jpeg" length="0" />
</item>
I'm using this code to parse the data:
from bs4 import BeautifulSoup
import urllib.request
req = urllib.request.urlopen('http://thenextweb.com/feed/')
xml = BeautifulSoup(req, 'xml')
for item in xml.findAll('item'):
string = item.description.string
#new_string = string.split('/>', 1)
#print(new_string[0]+'/><p>')
print(string)
Everything works when i run the script, but that particular item is failing.
The commented lines in the code are for splitting the img
and add a <p>
tag to order the content.
The result that i get from that item is:
’s DeepMind office earlier this week its man vs AI machine gaming competition preview, I was tipped off that a potentially-more-serious healthcare announcement would follow soon. That it has, but contrary to what the company’s remit might suggest, this project doesn’t actually contain any artificial intelligence at launch. “To date, no machine learning has been involved in these projects,” the company said. “While there is obvious potential in applying machine learning to these kinds of complex challenges, any decision to do so will led by clinicians.” DeepMind has announced an acquisition in the shape of an Imperial College London… <br><br><a href="http://thenextweb.com/google/2016/02/25/googles-first-deepmind-ai-health-project-is-missing-something/?utm_source=social&utm_medium=feed&utm_campaign=profeed">This story continues</a> at The Next Web
I don't know what is happening.
If somebody can help me or guide me through a way to extract the exact <img>
tag i would be very thankful.