Using Beautiful Soup to get the full URL in source code

Question

So I was looking at some source code and I came across this bit of code

<img src="/gallery/2012-winners-finalists/HM_Watching%20birds2_Shane%20Conklin_MA_2012.jpg"

now in the source code the link is blue and when you click it, it takes you to the full URL where that picture is located, I know how to get what is shown in the source code in Python using Beautiful Soup I was wondering though how to get the full URL you get once clicking the link in the source code?

EDIT: if I was given <a href = "/folder/big/a.jpg" how do you figure out the starting part of that url through python or beautiful soup?

poke · Answer 1 · 2013-08-01T16:29:30.033

34

<a href="/folder/big/a.jpg">

That’s an absolute address for the current host. So if the HTML file is at http://example.com/foo/bar.html, then applying the url /folder/big/a.jpg will result in this:

http://example.com/folder/big/a.jpg

I.e. take the host name and apply the new path to it.

Python has the builtin urljoin function to perform this operation for you:

>>> from urllib.parse import urljoin
>>> base = 'http://example.com/foo/bar.html'
>>> href = '/folder/big/a.jpg'
>>> urljoin(base, href)
'http://example.com/folder/big/a.jpg'

For Python 2, the function is within the urlparse module.

edited Aug 01 '13 at 16:29

answered Aug 01 '13 at 16:24

poke

369,085
72
557
602

(For joining the host and relative/absolute URL, see: http://stackoverflow.com/questions/8223939/how-to-join-absolute-and-relative-urls). – David Cain Aug 01 '13 at 16:25
@user2476540 Then the URL specified in the `a` tag is wrong. What I explained above is how the browser behaves when seeing a relative URL with a leading slash. – poke Aug 01 '13 at 18:02

score 0 · Answer 2 · answered Oct 11 '19 at 05:43

from bs4 import BeautifulSoup
import requests
import lxml

r = requests.get("http://example.com")

url = r.url  # this is base url
data = r.content  # this is content of page
soup = BeautifulSoup(data, 'lxml')
temp_url = soup.find('a')['href']  # you need to modify this selector

if temp_url[0:7] == "http://" or temp_url[0:8] == "https://" :  # if url have http://
        url = temp_url
else:
        url = url + temp_url


print url  # this is your full url

score 0 · Answer 3 · answered Mar 14 '22 at 10:54

0

import os


current_url = 'https://example.com/b/c.html?a=1&b=2'
href = '/folder/big/a.jpg'
absolute_url = os.path.dirname(current_url) + href
print(absolute_url)

answered Mar 14 '22 at 10:54

eccstartup

501
8
24

Using Beautiful Soup to get the full URL in source code

3 Answers3

Linked

Related