I can use Python Beautiful Soup module to extract news items from a site feed URL. But suppose the site has no feed and I need to extract news articles from it on daily basis as if it had a feed.
The site https://www.jugantor.com/ has no feed. Even by googling, I did not find any . With the following code snippet, I tried to extract the links from the site . The result shows links such as 'http://epaper.jugantor.com'. But the news items appearing on the site are nor included in the extracted links.
My Code:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
def getLinks(url):
USER_AGENT = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5'
request = Request(url)
request.add_header('User-Agent', USER_AGENT)
response = urlopen(request)
content = response.read().decode('utf-8')
response.close()
soup = BeautifulSoup(content, "html.parser")
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
links.append(link.get('href'))
return links
print(getLinks("https://www.jugantor.com/"))
Obviously this does not serve the intended purpose. I need all the news article links of 'https://www.jugantor.com/' on a daily basis as if I acquire them from a feed. I can use a cron job to run a script daily. But the challenge remains in identifying all articles published on a particular day and then extracting them.
How can I do that ? Any python module or algorithm etc ?
N.B: A somewhat similar question exists here which does not mention the feed to be the the parsing source.It seems the OP there is concerned to extract articles from a page that lists them as a textual snapshot. Unlike that question, my question focuses on sites that do not have any feed. And the only answer existing there does not address this issue however.