Names from a webpage are not getting scraped

Question

Running my scraper I could see that it fetches nothing from yell.com. So far I know Xpaths are alright. Can't really find out whether I've made any mistakes. Hope there is any workaround. I tried with the below code:

import requests
from lxml import html

url="https://www.yell.com/ucs/UcsSearchAction.do?keywords=pizza&location=all+states&scrambleSeed=821749505"
def Startpoint(address):
    response = requests.get(address)
    tree = html.fromstring(response.text)
    titles = tree.xpath('//div[contains(@class,"col-sm-24")]')
    for title in titles:
        try:
            Name=title.xpath('.//h2[@itemprop="name"]/text()')[0]
            print(Name)
        except exception as e:
            print(e.message)
            continue
Startpoint(url)

score 1 · Accepted Answer · edited May 23 '17 at 12:26

1

You need to specify a User-Agent string pretending to be a real browser:

response = requests.get(address, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36'})

Some other notes:

Exception starts with an upper case letter
you should not be using col-sm-24 class inside your locator - this kind of bootstrap class is layout-specific and does not really bring any data container specific type of information. Use businessCapsule class instead:
```
titles = tree.xpath("//div[contains(concat(' ', @class, ' '), ' businessCapsule ')]")
```
Note how we properly check the class attribute here.

you can use findtext() method to find the result titles:

results = tree.xpath("//div[contains(concat(' ', @class, ' '), ' businessCapsule ')]")

for result in results:
    name = result.findtext('.//h2[@itemprop="name"]')
    print(name)

edited May 23 '17 at 12:26

Community

1
1

answered May 03 '17 at 06:12

alecxe

462,703
120
1,088
1,195

You are just awesome, sir alecxe. Every time you touch on my messy code it works like magic. Gonna accept it in a while. – SIM May 03 '17 at 06:20
Findtext method is quite new to me. It works perfectly. Is this "findtext" method applicable for extracting "href" as well? – SIM May 03 '17 at 07:18
1

@SMth80 nope, `findtext()` is only about the text of a node. Thanks. – alecxe May 03 '17 at 09:37

Names from a webpage are not getting scraped

1 Answers1