0

I am trying to fetch a URL from a webpage, here is how the URL looks in the Inspect section: enter image description here

Here is how the URL looks in my python-code:

enter image description here

How can I get the actual URL without the ../../ part using BeautifulSoup? Here is my code in case it's needed:

import re
import requests
from bs4 import BeautifulSoup

source = requests.get('https://books.toscrape.com/catalogue/category/books_1/index.html').text
soup = BeautifulSoup(source, 'lxml')

# article = soup.find('article')
# title = article.div.a.img['alt']
# print(title['alt'])


titles, topics,urls,sources = [], [], [],[]
article_productPod = soup.findAll('article', {"class":"product_pod"})
for i in article_productPod:
    titles.append(i.div.a.img['alt'])
# print(titles)
for q in article_productPod:
    urls.append(q.h3.a['href'])
print(urls[0])
# for z in range(len(urls)):
    # source2 = requests.get("https://" + urls[z])

Ethan Brown
  • 107
  • 6
  • 2
    Does this answer your question? [Scrape the absolute URL instead of a relative path in python](https://stackoverflow.com/questions/44001007/scrape-the-absolute-url-instead-of-a-relative-path-in-python) – buran May 13 '22 at 17:03
  • The URL is saved in the `urls` array, the output in the photo is printed by the `print(urls[0])` line – Ethan Brown May 13 '22 at 17:04

1 Answers1

2

Use urllib:

import urllib

Store your target URL in a separate variable :

src_url = r'https://books.toscrape.com/catalogue/category/books_1/index.html'
source = requests.get(src_url).text

Join the website's URL and the relative URL:

for q in article_productPod:
    urls.append(urllib.parse.urljoin(src_url, q.h3.a['href']))