2

I can use Python Beautiful Soup module to extract news items from a site feed URL. But suppose the site has no feed and I need to extract news articles from it on daily basis as if it had a feed.

The site https://www.jugantor.com/ has no feed. Even by googling, I did not find any . With the following code snippet, I tried to extract the links from the site . The result shows links such as 'http://epaper.jugantor.com'. But the news items appearing on the site are nor included in the extracted links.

My Code:

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re


def getLinks(url):

    USER_AGENT = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5'
    request = Request(url)
    request.add_header('User-Agent', USER_AGENT)
    response = urlopen(request)
    content = response.read().decode('utf-8')
    response.close()

    soup = BeautifulSoup(content, "html.parser")
    links = []

    for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
        links.append(link.get('href'))

    return links

print(getLinks("https://www.jugantor.com/"))

Obviously this does not serve the intended purpose. I need all the news article links of 'https://www.jugantor.com/' on a daily basis as if I acquire them from a feed. I can use a cron job to run a script daily. But the challenge remains in identifying all articles published on a particular day and then extracting them.

How can I do that ? Any python module or algorithm etc ?

N.B: A somewhat similar question exists here which does not mention the feed to be the the parsing source.It seems the OP there is concerned to extract articles from a page that lists them as a textual snapshot. Unlike that question, my question focuses on sites that do not have any feed. And the only answer existing there does not address this issue however.

Istiaque Ahmed
  • 6,072
  • 24
  • 75
  • 141

1 Answers1

1

I'm not sure to understand right, but first thing I saw is {'href': re.compile("^http://")}.

You will miss all https and relative links. Relatives links could be skipped here without any problems (I guess..), but clearly not https ones. So first thing:

{'href': re.compile("^https?://")}

Then, to avoid to download and parse same URL each day, you could extract the id of the article (in https://www.jugantor.com/lifestyle/19519/%E0%...A7%87 id is 19519), save this in database and so verify first if the id exist before scraping the page.

Last thing, I'm not sure this will be useful, but this url https://www.jugantor.com/todays-paper/ makes me think you should be able to find only today's news.

Arount
  • 9,853
  • 1
  • 30
  • 43
  • `{'href': re.compile("^https?://")}` - really worked as expected. It returns 266 links in total for the URL https://www.jugantor.com/. 1) are these the all exclusive links (without excluding a single one ) from the whole site ? 2) For the second time, when I am extracting new news, do I have to extract all news items again and then find out which ones are new ? – Istiaque Ahmed Feb 19 '18 at 18:50
  • 1) URLs you got are all links starting by `http[s]` you will have in the whole page https://www.jugantor.com/ - but these links only. 2) Yes. You have to return only news pages. To do that, right before `links.append(link.get('href'))` check in database if the id of the page do not already exists, and after the `append` line, append the id into database so you won't crawl that again. 3) Game changer: https://www.jugantor.com/archive/online-edition/2018/02/19 - Enjoy – Arount Feb 19 '18 at 19:44
  • 1) ' you will have in the whole page ' = I think you meant – Istiaque Ahmed Feb 19 '18 at 19:50
  • The previous comment was incomplete I am reproducing it in the following comment. – Istiaque Ahmed Feb 19 '18 at 19:58
  • 1) ' you will have in the whole page ' - I think you meant the whole site instead of the whole page 2) I have to run a cron job repeatedly for every seconds to look for a single new news item appearing on the site. According to your suggestion, I have to scan the whole site ( may be thousands of articles ) to find a new item every few seconds. Is the process efficient enough ? And I am talking about any site that does not have any feed, not just about https://www.jugantor.com/. 3) yes it is very useful. But their site design may change without our automated system not knowing it. – Istiaque Ahmed Feb 19 '18 at 19:59