2

I am trying to build an application for which I need daily news feed from several websites. One way to do this is by using BeautifulSoup library of Python. However this is good for pages which have their news on one static page.

Let's consider a site like http://www.techcrunch.com. They have only one their headlines and for more news you need to click on "Read more". For several other news websites, it is similar. How do I extract such information and dump it in a file- txt/.dmp or any other kind of file? What tool should I use? What approach should I take to implement this in Python?

I need this script to automatically download news from several websites ONCE EVERY SINGLE DAY and store it in a file with categories such as, heading, date, content, etc. I would be uploading this script on apache2 server. Any suggestions?

HackCode
  • 1,837
  • 6
  • 35
  • 66
  • 1
    Use BeautifulSoup to find the 'read more' links, then open each page and parse the content from the new page. You will have to load a new page for each article, which is slow, so I'd recommend running that in parallel. To schedule this script, use cron job. You can write the script in python, indeed. And based on what you said, I can't see why you could benefit from machine-learning or AI. – Rafael Barros Mar 19 '15 at 14:45
  • 1
    Machine Learning or AI is where web scraping of this sort is mostly used, that is why I was hoping I would get some kind of alternative from that field. And yes, what you said is true. However, it would not be an automated system. I would have to do that for every specific page depending on their format. What I want, is any kind of news. Maybe someone can suggest me a different approach, maybe not scraping from web pages of this sort but some other way. Any suggestions. – HackCode Mar 19 '15 at 14:48

1 Answers1

0

How do I extract such information and dump it in a file- txt/.dmp or any other kind of file? What tool should I use?

for more news you need to click on "Read more".

The tools you might leverage are Selenuim as its pure browser automation or iMacros.

  1. Here is an example of leveraging Selenium in Python, server side.
  2. Here is a post (and video) on data extraction using iMacros. Since you need it only once a day you might schedule to run it regulary in Win or Mac.
Community
  • 1
  • 1
Igor Savinkin
  • 5,669
  • 8
  • 37
  • 69