Inclusive RSS parsing in Python?

Question

I'm parsing a set of rss feeds dynamically. This is my code which works for most sites.

class ParseFeeds:
    @staticmethod
    def parse(source):
        logger = logging.getLogger(__name__)
        logger.info("Starting {} at url: {}".format(source.name, source.link))
        root = ET.fromstring(requests.get(source.link).text)
        items = root.findall(".//item")
        logger.info(len(items))
        for item in items:
            title = ''
            if item.find('title') is not None:
                title = item.find('title').text
                title = ' '.join(title.split())
                title = re.sub("&#039;s", "'s", title)
            link = ''
            if item.find('link') is not None:
                link = item.find('link').text
            description = ''
            if item.find('description') is not None:
                description = item.find('description').text
                description = ' '.join(description.split())
                description = re.sub("&#039;s", "'s", description)
            published = timezone.now()
            if item.find('pubDate') is not None:
                logger.info(item.find('pubDate').text)
                published = maya.parse(item.find('pubDate').text).datetime()
            url = ''
            if item.find('enclosure') is not None:
                url = item.find('enclosure').attrib['url']
            if item.find('image') is not None:
                logger.info(item.find('image').text)
                url = item.find('image').text
            if not Feed.objects.filter(title=title).exists():
                logger.info(
                    "Adding feed with title:{} link:{} summary:{} published:{} url:{}".format(title, link, description,
                                                                                              published, url))
                feed = Feed(title=title, link=link, summary=description, published=published, url=url,
                            source=source)
                feed.save()
                logger.info("Adding {} from {}".format(feed.title, feed.source.name))

        logger.info("Finished {}".format(source.name))

However it fails to extract the url with this source.

https://www.football.london/?service=rss

item.find("media:thumbnail") doesn't work. How can I extract the value of the url in this source?

Why try reinventing the wheel when you have [feedparser](https://pypi.org/project/feedparser/). — 0xInfection, Feb 11 '19 at 00:17
Oh feedparser sucks at every level. It doesn't solve the problem that RSS feeds don't have standard tags. — Melissa Stewart, Feb 11 '19 at 00:45
Follow the answer here to find element with prefix: https://stackoverflow.com/a/14853417/2998271 — har07, Feb 11 '19 at 02:56

Inclusive RSS parsing in Python?

0 Answers0