I am using the Beautiful Soup module of python to get the feed URL of any website. But the code does not work for all sites. For example it works for http://www.extremetech.com/ but not for http://cnn.com/. Actually http://cnn.com/ redirects to https://edition.cnn.com/. So I used the later one but of no luck. But I found by googling that the feed of CNN is here .
My code follows:
import urllib.parse
import requests
import feedparser
from bs4 import BeautifulSoup as bs4
# from bs4 import BeautifulSoup
def findfeed(site):
user_agent = {
'User-agent':
'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17'}
raw = requests.get(site, headers = user_agent).text
result = []
possible_feeds = []
#html = bs4(raw,"html5lib")
html = bs4(raw,"html.parser")
feed_urls = html.findAll("link", rel="alternate")
for f in feed_urls:
t = f.get("type",None)
if t:
if "rss" in t or "xml" in t:
href = f.get("href",None)
if href:
possible_feeds.append(href)
parsed_url = urllib.parse.urlparse(site)
base = parsed_url.scheme+"://"+parsed_url.hostname
atags = html.findAll("a")
for a in atags:
href = a.get("href",None)
if href:
if "xml" in href or "rss" in href or "feed" in href:
possible_feeds.append(base+href)
for url in list(set(possible_feeds)):
f = feedparser.parse(url)
if len(f.entries) > 0:
if url not in result:
result.append(url)
for result_indiv in result:
print( result_indiv,end='\n ')
#return(result)
# findfeed("http://www.extremetech.com/")
# findfeed("http://www.cnn.com/")
findfeed("https://edition.cnn.com/")
How can I make the code work for all sites for example https://edition.cnn.com/ ? I am using python 3.
EDIT 1: If I need to use any module other than Beautiful Soup, I am ready to do that