1

Running my scraper I could see that it fetches nothing from yell.com. So far I know Xpaths are alright. Can't really find out whether I've made any mistakes. Hope there is any workaround. I tried with the below code:

import requests
from lxml import html

url="https://www.yell.com/ucs/UcsSearchAction.do?keywords=pizza&location=all+states&scrambleSeed=821749505"
def Startpoint(address):
    response = requests.get(address)
    tree = html.fromstring(response.text)
    titles = tree.xpath('//div[contains(@class,"col-sm-24")]')
    for title in titles:
        try:
            Name=title.xpath('.//h2[@itemprop="name"]/text()')[0]
            print(Name)
        except exception as e:
            print(e.message)
            continue
Startpoint(url)
SIM
  • 21,997
  • 5
  • 37
  • 109

1 Answers1

1

You need to specify a User-Agent string pretending to be a real browser:

response = requests.get(address, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36'})

Some other notes:

  • Exception starts with an upper case letter
  • you should not be using col-sm-24 class inside your locator - this kind of bootstrap class is layout-specific and does not really bring any data container specific type of information. Use businessCapsule class instead:

    titles = tree.xpath("//div[contains(concat(' ', @class, ' '), ' businessCapsule ')]")
    

    Note how we properly check the class attribute here.

  • you can use findtext() method to find the result titles:

    results = tree.xpath("//div[contains(concat(' ', @class, ' '), ' businessCapsule ')]")
    
    for result in results:
        name = result.findtext('.//h2[@itemprop="name"]')
        print(name)
    
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • You are just awesome, sir alecxe. Every time you touch on my messy code it works like magic. Gonna accept it in a while. – SIM May 03 '17 at 06:20
  • Findtext method is quite new to me. It works perfectly. Is this "findtext" method applicable for extracting "href" as well? – SIM May 03 '17 at 07:18
  • 1
    @SMth80 nope, `findtext()` is only about the text of a node. Thanks. – alecxe May 03 '17 at 09:37