Scrape the absolute URL instead of a relative path in python

Question

I'm trying to get all the href's from a HTML code and store it in a list for future processing such as this:

Example URL: www.example-page-xl.com

 <body>
    <section>
    <a href="/helloworld/index.php"> Hello World </a>
    </section>
 </body>

I'm using the following code to list the href's:

import bs4 as bs4
import urllib.request

sauce = urllib.request.urlopen('https:www.example-page-xl.com').read()
soup = bs.BeautifulSoup(sauce,'lxml')

section = soup.section

for url in section.find_all('a'):
    print(url.get('href'))

However I would like to store the URL as: www.example-page-xl.com/helloworld/index.php and not just the relative path which is /helloworld/index.php

Appending/joining the URL with the relative path isn't required since the dynamic links may vary when I join the URL and the relative path.

In a nutshell I would like to scrape the absolute URL and not relative paths alone (and without joining)

score 35 · Answer 1 · answered May 16 '17 at 13:00

35

urllib.parse.urljoin() might help. It does a join, but it is smart about it and handles both relative and absolute paths. Note this is python 3 code.

>>> import urllib.parse
>>> base = 'https://www.example-page-xl.com'

>>> urllib.parse.urljoin(base, '/helloworld/index.php') 
'https://www.example-page-xl.com/helloworld/index.php'

>>> urllib.parse.urljoin(base, 'https://www.example-page-xl.com/helloworld/index.php')
'https://www.example-page-xl.com/helloworld/index.php'

answered May 16 '17 at 13:00

Andrei Cioara

3,404
5
34
62

8

I think the best use case is missing: `urllib.parse.urljoin('https://example.com/subsection/', '/but-was-in-an-a-href')` equals `https://example.com/but-was-in-an-a-href` – aliqandil Sep 03 '19 at 21:24

score 32 · Accepted Answer · edited Apr 27 '18 at 13:51

32

In this case urlparse.urljoin helps you. You should modify your code like this-

import bs4 as bs4
import urllib.request
from urlparse import  urljoin

web_url = 'https:www.example-page-xl.com'
sauce = urllib.request.urlopen(web_url).read()
soup = bs.BeautifulSoup(sauce,'lxml')

section = soup.section

for url in section.find_all('a'):
    print urljoin(web_url,url.get('href'))

here urljoin manage absolute and relative paths.

edited Apr 27 '18 at 13:51

Vivek Jain

3,811
6
30
47

answered May 16 '17 at 13:16

Somil

1,921
1
21
35

2

You have the contents of the file in hand, it's silly to ignore the possibility that there is a `` https://www.w3.org/TR/html52/infrastructure.html#parsing-urls – rakslice Jul 11 '19 at 01:02
Is there any reason to prefer the `urlparse` package over the standard library's `urllib.parse`? – Ben Price Sep 27 '19 at 17:50

score 1 · Answer 3 · edited Dec 26 '22 at 09:15

I think another option is to go with _replace method of urllib.parse.urlparse Most of the time the baseurl will change, so instead of declaring it with the fixed value, I use the URL from the source and change its path.

from urllib.parse import urlparse

old_link = "https://www.example-page-xl.com/old-path"
>>> "https://www.example-page-xl.com/old-path"

new_link = urlparse(link)._replace(path="new-path").geturl()
>>> "https://www.example-page-xl.com/new-path"

Here is the structure of url: scheme://netloc/path;parameters?query#fragment. Find the documentation here

score -1 · Answer 4 · answered Dec 09 '20 at 17:08

I see the solution mentioned here to be the most robust.

import urllib.parse

def base_url(url, with_path=False):
    parsed = urllib.parse.urlparse(url)
    path   = '/'.join(parsed.path.split('/')[:-1]) if with_path else ''
    parsed = parsed._replace(path=path)
    parsed = parsed._replace(params='')
    parsed = parsed._replace(query='')
    parsed = parsed._replace(fragment='')
    return parsed.geturl()

Scrape the absolute URL instead of a relative path in python

4 Answers4

Linked