How to get the full link using BeautifulSoap

Question

The function get("href") is not returning the full link. In the html file exist the link:

But, the function link.get("href") return:

"navhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO"

sub_site = "https://www.fotoregistro.com.br/navhome.php?vitrine-produto-slim"

response = urllib.request.urlopen(sub_site)

data = response.read()

soup = BeautifulSoup(data,'lxml')
for link in soup.find_all('a'):

    url = link.get("href")
    print (url)

I don`t see any similar link on the [page](https://www.fotoregistro.com.br/navhome.php?vitrine-produto-slim) that you are trying to scrap. — gimme_danger, Jun 02 '19 at 09:38

sentence · Answer 1 · 2019-06-03T10:50:23.207

Let me focus on the specific part of your problem in the html:

<a class='warp_lightbox' title='Comprar' href='//www.fotoregistro.com.br/
navhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO'><img src='
//sh.digipix.com.br/subhomes/_lojas_consumer/paginas/fotolivro/img/180slim/vitrine/classic_01_tb.jpg' alt='slim' />
                              </a>

You can get it by doing:

for link in soup.find_all('a', {'class':'warp_lightbox'}):
    url = link.get("href")
    break

you find out that url is:

'//www.fotoregistro.com.br/\rnavhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO'

You can see two important patterns at the begininning of the string:

// which is a way to keep the current protocol, see this;
\r which is ASCII Carriage Return (CR).

When you print it, you simply lose this part:

//www.fotoregistro.com.br/\r

If you need the raw string, you can use repr in your for loop:

print(repr(url))

and you get:

//www.fotoregistro.com.br/\rnavhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO

If you need the path, you can replace the initial part:

base = 'www.fotoregistro.com.br/'

for link in soup.find_all('a', {'class':'warp_lightbox'}):
    url = link.get("href").replace('//www.fotoregistro.com.br/\r',base)
    print(url)

and you get:

www.fotoregistro.com.br/navhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO
www.fotoregistro.com.br/navhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/preview=true/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO
.
.
.

Without specifying the class:

for link in soup.find_all('a'):
    url = link.get("href")
    print(repr(url))

I'm looking for a generic solution, where without knowing the classes I can collect all available links — m4rc3l, Jun 02 '19 at 21:54

QHarr · Answer 2 · 2019-06-02T22:04:37.283

0

Use select and seems to print fine

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.fotoregistro.com.br/fotolivros/180-slim?cpmdsc=MOZAO')
soup = bs(r.content, 'lxml')
print([item['href'] for item in soup.select('.warp_lightbox')])

Use

print([item['href'] for item in soup.select('[href]')])

for all links.

edited Jun 02 '19 at 22:04

answered Jun 02 '19 at 13:54

QHarr

83,427
12
54
101

I'm looking for a generic solution, where without knowing the classes I can collect all available links. – m4rc3l Jun 02 '19 at 21:53
then use print([item['href'] for item in soup.select('a')]) – QHarr Jun 02 '19 at 21:54
this way i have: return self.attrs[key] KeyError: 'href' – m4rc3l Jun 02 '19 at 22:04
See edit above now print([item['href'] for item in soup.select('[href]')]) – QHarr Jun 02 '19 at 22:04

How to get the full link using BeautifulSoap

2 Answers2