Google news crawler to return results with url,title and briefing

Question

I am new to crawler and I am using Python 3.X. Currently I am practicing to crawl google news for fresh start but I have encounter some problem with my code(the code runs but did not return anything). I want the code to crawl google news for query and return results with url, title and briefing appear in results.

Many thanks for your time. my code is below:

import sys
import urllib
import requests
from bs4 import BeautifulSoup
import time

s = "Stack Overflow"
url = "http://www.google.com.sg/search?q="+s+"&tbm=nws&tbs=qdr:y"
#htmlpage = urllib2.urlopen(url).read()
time.sleep(randint(0, 2))
htmlpage = requests.get(url)
soup = BeautifulSoup(htmlpage.text,'lxml')
#print (len(soup.findAll("table", {"class": "result"})))
for result_table in soup.findAll("table", {"class": "result"}):
    a_click = result_table.find("a")
    print ("-----Title----\n" + a_click.renderContents())#Title
    print ("----URL----\n" + str(a_click.get("href")))#URL
    print ("----Brief----\n" + result_table.find("div", {"class": "c-abstract"}).renderContents())#Brief
    print ("Done")

Instead of a link to your code, can you please edit this and paste it into the question directly? Then tell us what you think might be wrong and where you got stuck, then ask a question as reflected in the title of your post.. — SDsolar, May 03 '17 at 02:33
hi thx for the remind, I have provided the code. I am stuck with my code not printing the url,title and briefing of the results at all. — Sun, May 03 '17 at 02:52

score 1 · Accepted Answer · answered May 03 '17 at 03:12

1

This is how i got results, hope it helps:

>>> for result_table in soup.findAll("div", {"class": "g"}):
...     a_click = result_table.find("a")
...     print ("-----Title----\n" + str(a_click.renderContents()))#Title
...     print ("----URL----\n" + str(a_click.get("href")))#URL
...     print ("----Brief----\n" + str(result_table.find("div", {"class": "st"}).renderContents()))#Brief
...     print ("Done")
... 
-----Title----
b"<b>Stack Overflow</b>: Like sleep? Don't code in C"
----URL----
/url?q=http://www.infoworld.com/article/3190701/application-development/stack-overflow-like-sleep-dont-code-in-c.html&sa=U&ved=0ahUKEwjc34W_3NLTAhVIMY8KHVu_BoUQqQIIFigAMAA&usg=AFQjCNE7xDqkg-kyFR65krfMIJqIchHFwg
----Brief----
b'In analysis of programming traffic on the <b>Stack Overflow</b> online community over for four weeks last August, <b>Stack Overflow</b> Insights data scientist David Robinson,\xc2\xa0...'
Done

answered May 03 '17 at 03:12

JkShaw

1,927
2
13
14

1

Thanks man! Now suppose I want exact result of "Stack+Overflow", how should I change the code? make s="Stack+Overflow" seems doesn't work – Sun May 03 '17 at 03:32
1

solved, just change s="Stack+Overflow" to s='"Stack+Overflow"', it works! – Sun May 03 '17 at 03:40
Excellent article. I know it is true. Arduino programming becomes a workflow you don't want to interrupt. And for real C++ programs on the desktop machine there are so many things to remember, like pointers and variable names, not to dentin the wayward curly braces and indentation, and classes which you have memorized you have to go review them to remember all the methods involved or properties in data structures, yep, that to go to sleep means having to relearn your own code (and all the libraries you may have borrowed) when you wake up, wasting valuable programming time. ;-) – SDsolar May 03 '17 at 07:26
Here is a blast from when I was in CS school. Real Programmers Use Fortran. Modified because they were making us use Pascal: http://web.mit.edu/humor/Computers/real.programmers – SDsolar May 03 '17 at 07:28
Hi JkShaw, sorry to bother you again, I realize the code can only crawl page 1 of the result instead of all results. Is there anything need to add to crawl all results in pages? – Sun May 04 '17 at 03:35
@Sun, I think you are getting results from page 1 only instead of all pages, in order to get results from all pages, you need to get the `next` page `url` and repeat the process you have already written. – JkShaw May 04 '17 at 04:33
@JkShaw I get your idea, basically I need to know the last result and divided the per page result limit. Any idea to get the count of last result? – Sun May 04 '17 at 05:25
please refer http://stackoverflow.com/questions/28597041/scraping-multiple-paginated-links-with-beautifulsoup-and-requests and http://stackoverflow.com/questions/31062435/how-can-i-loop-scraping-data-for-multiple-pages-in-a-website-using-python-and-be .it will give a pretty good idea on how to proceed. – JkShaw May 04 '17 at 06:32

Google news crawler to return results with url,title and briefing

1 Answers1