python beautifulsoup no link when parsing 'a' tag and href

Question

Apologies if there is a duplicate, I searched but couldn't find an answer. I was writing a scraper to scrape a default directory index page served by my webserver. The html looks like this

<html>
<head><title>Index of /Mysongs</title></head>
<body bgcolor="white">
<h1>Index of /Mysongs</h1><hr><pre><a href="../">../</a>
<a href="Mysong1.mkv">Mysong1.mp3</a>                        10-May-2016 07:24           183019
<a href="Mysong2.mkv">Mysong2.ogg</a>                        10-May-2016 07:27           177205

The href link looks like a text only, and not a url (<a href="Mysong2.mkv">), but on pointing to the text, it shows the link in the browser's status bar (http://127.0.0.1/Mysongs/Mysong2.ogg)

I tried to extract the url using beautifulsoup, like this

#!/usr/bin/python

import httplib2
import sys
from BeautifulSoup import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request(sys.argv[1])
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
    print link.get('href')

and I am not able to get the link like http://127.0.0.1/Mysongs/Mysong2.ogg, but only <a href="Mysong1.mkv">Mysong1.mp3</a> 10-May-2016 07:24

Should I be using the sys.argv[1] to construct the href link like

print sys.argv[1] + link.get('href')

or is there some better way to get this?

Edit:: Current output is

Mysong1.mp3
Mysong2.ogg

Expected output:

http://127.0.0.1/Mysong1.mp3
http://127.0.0.1/Mysong1.0gg

Thank you @AniMenon, I was intending to pass these urls to an external download accelerator that downloads chunks consecutively. So I was looking for urls instead of just plain text. I could create the url by concatenating the base url and the text, but I was wondering if this is the only way, and there is a pythonic way or a module support. — init, Jun 04 '16 at 13:11
@init Beautiful soup always returns whatever is in the href, so if you want the output like that then adding the base url is your only option. — Ani Menon, Jun 04 '16 at 13:21
Thank you @AniMenon, if you post it as an answer, I will accept it. I was unsure whether there are any other methods that the module has, that could be used. — init, Jun 04 '16 at 13:23

score 1 · Accepted Answer · answered Jun 04 '16 at 13:26

Yes your only option is to add the base url. But don't add it this way:

print sys.argv[1] + link.get('href')

Use this:

from urlparse import urljoin
urljoin('http://something.com/random/abc.html', '../../music/MySong.mp3')

In your method, the relative paths may not be identified & handled, urljoin handles it.

python beautifulsoup no link when parsing 'a' tag and href

1 Answers1