1

Following is the content of an item tag of an XML file. How can I extract the media:content tag using BeautifulSoup?

<item>
            <title>How Kerala is preparing for monsoon amid the COVID-19 pandemic</title>
            <link/>https://www.thenewsminute.com/article/how-kerala-preparing-monsoon-amid-covid-19-pandemic-125007
                  <description>Usually, Kerala begins its procedure for monsoon preparedness by January. This year, however, the officials got busy with preparing for a health crisis instead. “Kerala works six months and fights the monsoon in the other six months,” says Sekhar Kuriakose, member secretary of the Kerala State Disaster Management Authority (KSDMA). Usually, Kerala begins its monsoon preparedness by January, even before the India Meteorological Department (IMD) makes its first long-range forecast for southwe...</description>
            <pubdate>Thu, 21 May 2020 10:30:00 GMT</pubdate>
            <guid>https://www.thenewsminute.com/article/how-kerala-preparing-monsoon-amid-covid-19-pandemic-125007</guid>
            <media:content medium="image" url="https://www.thenewsminute.com/sites/default/files/Kerala-rain-trivandrum-1200.jpg" width="600"></media:content>
</item>
Jan
  • 42,290
  • 8
  • 54
  • 79
  • 1
    It won't get easier than this - what have you tried? There are zillion tutorials online. – Jan May 21 '20 at 14:09
  • 2
    Does this answer your question? [Beautiful Soup and extracting a div and its contents by ID](https://stackoverflow.com/questions/2136267/beautiful-soup-and-extracting-a-div-and-its-contents-by-id) – Jan May 21 '20 at 14:11
  • Thanks for answering my question, but it doesn't help. You see the media tag here is a custom XML tag and by using find('media'), it's returning none. And I can't find it by using id because there's no id. – Abhilash Kr May 21 '20 at 14:36

1 Answers1

2

Your issue may be how BS4 handles namespaces with the parser backend you are using. Specifying "LXML" instead of "XML" allows you to use find() and find_all() as you might expect in this case.

Letting t be a string with the XML you provided,

soup = BeautifulSoup(t, "xml")
print(soup.find_all("media:content"))

produces

[]

However, by using the LXML parser, it is able to find the element:

soup = BeautifulSoup(t, "lxml")
print(soup.find_all("media:content"))

produces

[<media:content medium="image" (...)></media:content>]
Bernardo Sulzbach
  • 1,293
  • 10
  • 26