0

The function get("href") is not returning the full link. In the html file exist the link:

enter image description here

But, the function link.get("href") return:

"navhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO"

sub_site = "https://www.fotoregistro.com.br/navhome.php?vitrine-produto-slim"

response = urllib.request.urlopen(sub_site)

data = response.read()

soup = BeautifulSoup(data,'lxml')
for link in soup.find_all('a'):

    url = link.get("href")
    print (url)  
sentence
  • 8,213
  • 4
  • 31
  • 40
m4rc3l
  • 1
  • 1
  • I don`t see any similar link on the [page](https://www.fotoregistro.com.br/navhome.php?vitrine-produto-slim) that you are trying to scrap. – gimme_danger Jun 02 '19 at 09:38

2 Answers2

0

Let me focus on the specific part of your problem in the html:

<a class='warp_lightbox' title='Comprar' href='//www.fotoregistro.com.br/
navhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO'><img src='
//sh.digipix.com.br/subhomes/_lojas_consumer/paginas/fotolivro/img/180slim/vitrine/classic_01_tb.jpg' alt='slim' />
                              </a>

You can get it by doing:

for link in soup.find_all('a', {'class':'warp_lightbox'}):
    url = link.get("href")
    break

you find out that url is:

'//www.fotoregistro.com.br/\rnavhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO'

You can see two important patterns at the begininning of the string:

  • // which is a way to keep the current protocol, see this;
  • \r which is ASCII Carriage Return (CR).

When you print it, you simply lose this part:

//www.fotoregistro.com.br/\r

If you need the raw string, you can use repr in your for loop:

print(repr(url))

and you get:

//www.fotoregistro.com.br/\rnavhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO

If you need the path, you can replace the initial part:

base = 'www.fotoregistro.com.br/'

for link in soup.find_all('a', {'class':'warp_lightbox'}):
    url = link.get("href").replace('//www.fotoregistro.com.br/\r',base)
    print(url)

and you get:

www.fotoregistro.com.br/navhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO
www.fotoregistro.com.br/navhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/preview=true/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO
.
.
.

Without specifying the class:

for link in soup.find_all('a'):
    url = link.get("href")
    print(repr(url))
sentence
  • 8,213
  • 4
  • 31
  • 40
  • I'm looking for a generic solution, where without knowing the classes I can collect all available links – m4rc3l Jun 02 '19 at 21:54
0

Use select and seems to print fine

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.fotoregistro.com.br/fotolivros/180-slim?cpmdsc=MOZAO')
soup = bs(r.content, 'lxml')
print([item['href'] for item in soup.select('.warp_lightbox')])

Use

print([item['href'] for item in soup.select('[href]')])

for all links.

QHarr
  • 83,427
  • 12
  • 54
  • 101