Python/BeautifulSoup: Retrieving 'href' attribute

Question

I am trying to get the href attribute from a website I am scraping. My script:

from bs4 import BeautifulSoup
import requests
import csv


i = 1
for i in range(1, 2, 1):
   i = str(i)
   baseurl = "https://www.quandoo.nl/amsterdam?page=" + i
   r1 = requests.get(baseurl)
   data = r1.text
   soup = BeautifulSoup(data, "html.parser")
   for link in soup.findAll('span', {'class', "merchant-title", 'itemprop', "name", 'a'}):
       print link

Returns the following:

<span class="merchant-title" itemprop="name"><a href="https://www.quandoo.nl/place/ristorante-due-napoletani-5644" itemprop="url">Ristorante Due Napoletani</a></span>
<span class="merchant-title" itemprop="name"><a href="https://www.quandoo.nl/place/yamyam-4850" itemprop="url">YamYam</a></span>
<span class="merchant-title" itemprop="name"><a href="https://www.quandoo.nl/place/the-golden-temple-5278" itemprop="url">The Golden Temple</a></span>
<span class="merchant-title" itemprop="name"><a href="https://www.quandoo.nl/place/sampurna-4609" itemprop="url">Sampurna</a></span>
<span class="merchant-title" itemprop="name"><a href="https://www.quandoo.nl/place/motto-sushi-25471" itemprop="url">Motto Sushi</a></span>
<span class="merchant-title" itemprop="name"><a href="https://www.quandoo.nl/place/takumi-ya-8171" itemprop="url">Takumi-Ya</a></span>
<span class="merchant-title" itemprop="name"><a href="https://www.quandoo.nl/place/casa-di-david-19167" itemprop="url">Casa di David</a></span>

(This is only part of it. I didn't want to bombard you with the entire output.) I have no issue pulling out the string with the restaurants name, but I can't find a configuration to give me just the href attribute. And the .strip() method doesn't seem feasible with my current configuration. Any help would be great.

I'm sorry I am a bit confused which variable should I try converting str() — wavey, Nov 22 '16 at 16:29
If you were using the code from link you should try `print str(a['href'])`. — zipa, Nov 22 '16 at 16:40

zipa · Accepted Answer · 2016-11-23T09:16:48.427

1

Try with this code, it works for me:

from bs4 import BeautifulSoup
import requests
import csv

import re


i = 1
for i in range(1, 2, 1):
   i = str(i)
   baseurl = "https://www.quandoo.nl/amsterdam?page=" + i
   r1 = requests.get(baseurl)
   data = r1.text
   soup = BeautifulSoup(data, "html.parser")
   for link in soup.findAll('span', {'class', "merchant-title", 'itemprop', "name", 'a'}):
       match = re.search(r'href=[\'"]?([^\'" >]+)', str(link)).group(0)
       print match

edited Nov 23 '16 at 09:16

answered Nov 22 '16 at 17:40

zipa

27,316
6
40
58

Thank you! I tried this configuration earlier; however, I am trying to isolate the restaurant links in on the page. Those are the only ones I need to scrape further. Any ideas on how to isolate the hrefs to just the restaurants on the page? – wavey Nov 22 '16 at 18:04

Python/BeautifulSoup: Retrieving 'href' attribute

1 Answers1