0

I am using the Beautiful Soup module of python to get the feed URL of any website. But the code does not work for all sites. For example it works for http://www.extremetech.com/ but not for http://cnn.com/. Actually http://cnn.com/ redirects to https://edition.cnn.com/. So I used the later one but of no luck. But I found by googling that the feed of CNN is here .

My code follows:

import urllib.parse
import requests
import feedparser
from bs4 import BeautifulSoup as bs4
# from bs4 import BeautifulSoup


def findfeed(site):
    user_agent = {
        'User-agent':
            'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17'}
    raw = requests.get(site, headers = user_agent).text
    result = []
    possible_feeds = []
    #html = bs4(raw,"html5lib")
    html = bs4(raw,"html.parser")
    feed_urls = html.findAll("link", rel="alternate")



    for f in feed_urls:
        t = f.get("type",None)
        if t:
            if "rss" in t or "xml" in t:
                href = f.get("href",None)
                if href:
                    possible_feeds.append(href)
    parsed_url = urllib.parse.urlparse(site)
    base = parsed_url.scheme+"://"+parsed_url.hostname
    atags = html.findAll("a")
    for a in atags:
        href = a.get("href",None)
        if href:
            if "xml" in href or "rss" in href or "feed" in href:
                possible_feeds.append(base+href)
    for url in list(set(possible_feeds)):
        f = feedparser.parse(url)
        if len(f.entries) > 0:
            if url not in result:
                result.append(url)

    for result_indiv in result:
                print( result_indiv,end='\n  ')
    #return(result)




# findfeed("http://www.extremetech.com/")
# findfeed("http://www.cnn.com/")
findfeed("https://edition.cnn.com/")

How can I make the code work for all sites for example https://edition.cnn.com/ ? I am using python 3.

EDIT 1: If I need to use any module other than Beautiful Soup, I am ready to do that

Istiaque Ahmed
  • 6,072
  • 24
  • 75
  • 141
  • `requests` follows redirects; requesting `www.cnn.com` redirects you to `edition.cnn.com`; the response you get is for the last part of the redirects. Do a get for `www` and look at the response `.history` and `.url` attributes, and you'll see you end up on `edition`. – Martijn Pieters Feb 16 '18 at 12:56
  • 1
    RSS and atom feeds *can* be autodiscoverable via `link` tags, but this is by no means a given. The CNN homepage has no such links, so you are out of luck there. – Martijn Pieters Feb 16 '18 at 12:56
  • A quick google leads to http://edition.cnn.com/services/rss/, but I don't know where CNN is linking to that page. Plugging `edition.cnn.com` into an establish RSS reader with discovery support (inoreader.com) shows that it too can't find feeds. – Martijn Pieters Feb 16 '18 at 12:59
  • 1
    Istiaque, you should read some documentation on how to make a web API. If you do so, you will realize that while some sites may follow a common convention, nothing requires them to do so, so you cannot assume that a site even has rss feeds, much less links to them from their homepage. – Russia Must Remove Putin Feb 16 '18 at 13:05

2 Answers2

3

How can I make the code work for all sites

You can't. Not every site follows the best practices.

It is recommended that the site homepage includes a <link rel="alternate" type="application/rss+xml" ...> or <link rel="alternate" type="application/atom+xml" ...> element, but CNN doesn't follow the recommendation. There is no way around this.

But I found by googling that the feed of CNN is here.

That is not the homepage, and CNN has not provided any means to discover it. There is currently no automated method to discover what sites have made this error.

Actually http://cnn.com/ redirects to https://edition.cnn.com/

Requests handles redirection for you automatically:

>>> response = requests.get('http://cnn.com')
>>> response.url
'https://edition.cnn.com/'
>>> response.history
[<Response [301]>, <Response [301]>, <Response [302]>]

If I need to use any module other than BeautifulSoup, I am ready to do that

This is not a problem a module can solve. Some sites don't implement autodiscovery or do not implement it correctly.

For example, established RSS feed software that implement autodiscovery support (like the online https://inoreader.com), can't find the CNN feeds either, unless you use the specific /services/rss URL you found with Googling.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
0

Looking at this answer. This should work perfectly:

feeds = html.findAll(type='application/rss+xml') + html.findAll(type='application/atom+xml')

Trying that on the CNN RSS service works perfectly. Your main problem is that the edition.cnn.com does not have any traces of RSS in any way or fashion.

The Pjot
  • 1,801
  • 1
  • 12
  • 20
  • What more types other than rss and atom do I need to look for ? – Istiaque Ahmed Feb 16 '18 at 11:45
  • That I don't know. Just that these two are most widely used to my knowledge :) – The Pjot Feb 16 '18 at 12:00
  • my python shell kept waiting and waiting with http://edition.cnn.com/services/rss/ as input. Won't https://edition.cnn.com/ or http://cnn.com/ work ? – Istiaque Ahmed Feb 16 '18 at 12:06
  • If edition.cnn.com/services/rss has a RSS link, then https://edition.cnn.com/ also has one, right ? – Istiaque Ahmed Feb 16 '18 at 12:18
  • No, it would have been user friendlier if it did. But just because they are on the same domain does not in anyway mean that it will contain RSS feeds. – The Pjot Feb 16 '18 at 12:27
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/165288/discussion-between-istiaque-ahmed-and-the-pjot). – Istiaque Ahmed Feb 16 '18 at 12:28
  • You could use a CSS selector to get one list in one step: `soup.select('link[type$=+xml]')` (bit of a shortcut) or `soup.select('link[type=application/rss+xml],link[type=application/atom+xml]')` (more explicit). – Martijn Pieters Feb 16 '18 at 13:06