13

So I was looking at some source code and I came across this bit of code

<img src="/gallery/2012-winners-finalists/HM_Watching%20birds2_Shane%20Conklin_MA_2012.jpg"

now in the source code the link is blue and when you click it, it takes you to the full URL where that picture is located, I know how to get what is shown in the source code in Python using Beautiful Soup I was wondering though how to get the full URL you get once clicking the link in the source code?

EDIT: if I was given <a href = "/folder/big/a.jpg" how do you figure out the starting part of that url through python or beautiful soup?

poke
  • 369,085
  • 72
  • 557
  • 602
user2476540
  • 9,265
  • 4
  • 15
  • 9

3 Answers3

34
<a href="/folder/big/a.jpg">

That’s an absolute address for the current host. So if the HTML file is at http://example.com/foo/bar.html, then applying the url /folder/big/a.jpg will result in this:

http://example.com/folder/big/a.jpg

I.e. take the host name and apply the new path to it.

Python has the builtin urljoin function to perform this operation for you:

>>> from urllib.parse import urljoin
>>> base = 'http://example.com/foo/bar.html'
>>> href = '/folder/big/a.jpg'
>>> urljoin(base, href)
'http://example.com/folder/big/a.jpg'

For Python 2, the function is within the urlparse module.

poke
  • 369,085
  • 72
  • 557
  • 602
  • (For joining the host and relative/absolute URL, see: http://stackoverflow.com/questions/8223939/how-to-join-absolute-and-relative-urls). – David Cain Aug 01 '13 at 16:25
  • @user2476540 Then the URL specified in the `a` tag is wrong. What I explained above is how the browser behaves when seeing a relative URL with a leading slash. – poke Aug 01 '13 at 18:02
0
from bs4 import BeautifulSoup
import requests
import lxml

r = requests.get("http://example.com")

url = r.url  # this is base url
data = r.content  # this is content of page
soup = BeautifulSoup(data, 'lxml')
temp_url = soup.find('a')['href']  # you need to modify this selector

if temp_url[0:7] == "http://" or temp_url[0:8] == "https://" :  # if url have http://
        url = temp_url
else:
        url = url + temp_url


print url  # this is your full url
Biplob Das
  • 2,818
  • 21
  • 13
0
import os


current_url = 'https://example.com/b/c.html?a=1&b=2'
href = '/folder/big/a.jpg'
absolute_url = os.path.dirname(current_url) + href
print(absolute_url)
eccstartup
  • 501
  • 8
  • 24