1

I have been working on this problem for the last 10 hours and I am still unable to solve it. The code works for some people, but it is not working for me.

The main purpose is to extract Google results URL for all pages for https://www.google.com.au/webhp?num=100&gl=au&hl=en#q=site:focusonfurniture.com.au&gl=au&hl=en&start=0

And here is my code:

# -*- coding: utf-8
from bs4 import BeautifulSoup
import urllib, urllib2

def google_scrape(query):
    address = "https://www.google.com.au/webhp?num=100&gl=au&hl=en#q=site:focusonfurniture.com.au&gl=au&hl=en&start=0".format (urllib.quote_plus(query))
    request = urllib2.Request(address, None, {'User-Agent':'Mozilla/43.0.1'})
    urlfile = urllib2.urlopen(request)
    html = urlfile.read()
    soup = BeautifulSoup(html)
    linkdictionary = {}

    for li in soup.findAll('div', attrs={'class' : 'g'}): # It never goes inside this for loop as find.All results Null

        sLink = li.find('.r a')
        print sLink['href']

    return linkdictionary

if __name__ == '__main__':
    links = google_scrape('beautifulsoup')
    print links

I am getting {} as a result.The code soup.findAll('div', attrs={'class' : 'g'}) is returning null and therefore, I am unable to scrape any results.

I am using BS4 and Python 2.7. Please help me as to why the code is not working properly. Any help would be much appreciated.

Further, it would be great if someone can give an insight as to why does the same code works for some people and not for others ? (Happened to me last time as well). Thanks.

  • 1
    Well, one problem I see straight away is that you're trying to put a query into your `address` string using `.format()` but there are no placeholders in your string to tell Python where to put the query. – kindall Dec 26 '16 at 17:20
  • @kindall Even removing it doesn't work. Have you ran the same code on you computer ? Does it work ? – Muhammad Irfan Ali Dec 26 '16 at 17:52
  • 1
    is better if you use the internal API for this (or use selenium) this http://stackoverflow.com/questions/4082966/what-are-the-alternatives-now-that-the-google-web-search-api-has-been-deprecated/11206266#11206266 and this https://github.com/scraperwiki/google-search-python could help! – nguaman Dec 26 '16 at 18:09
  • @wu4m4n Thanks for your response. I will look into it. Looks a bit complicated because I have never worked on APIs before. Can you please explain why python code is unable to scrape the data ? Is it because of some restriction from Google ? – Muhammad Irfan Ali Dec 27 '16 at 04:11

1 Answers1

0

this is an example of what you can do. you need selenium and phantomjs (this simulate a browser)

import selenium.webdriver
from pprint import pprint
import re 

url = 'https://www.google.com.au/webhp?num=100&gl=au&hl=en#q=site:focusonfurniture.com.au&gl=au&hl=en&start=0'
driver = selenium.webdriver.PhantomJS()
driver.get(url)
html =  driver.page_source


regex = r"<cite>(https:\/\/www\.focusonfurniture\.com\.au\/[\/A-Z]+)<\/cite>"

result = re.findall(re.compile(regex, re.IGNORECASE | re.MULTILINE),html)
for url in result:
    print url

driver.quit()

result :

https://www.focusonfurniture.com.au/delivery/
https://www.focusonfurniture.com.au/terms/
https://www.focusonfurniture.com.au/disclaimer/
https://www.focusonfurniture.com.au/dining/
https://www.focusonfurniture.com.au/bedroom/
https://www.focusonfurniture.com.au/catalogue/
https://www.focusonfurniture.com.au/mattresses/
https://www.focusonfurniture.com.au/clearance/
https://www.focusonfurniture.com.au/careers/
nguaman
  • 925
  • 1
  • 9
  • 23