1

Apologies if there is a duplicate, I searched but couldn't find an answer. I was writing a scraper to scrape a default directory index page served by my webserver. The html looks like this

<html>
<head><title>Index of /Mysongs</title></head>
<body bgcolor="white">
<h1>Index of /Mysongs</h1><hr><pre><a href="../">../</a>
<a href="Mysong1.mkv">Mysong1.mp3</a>                        10-May-2016 07:24           183019
<a href="Mysong2.mkv">Mysong2.ogg</a>                        10-May-2016 07:27           177205

The href link looks like a text only, and not a url (<a href="Mysong2.mkv">), but on pointing to the text, it shows the link in the browser's status bar (http://127.0.0.1/Mysongs/Mysong2.ogg)

I tried to extract the url using beautifulsoup, like this

#!/usr/bin/python

import httplib2
import sys
from BeautifulSoup import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request(sys.argv[1])
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
    print link.get('href')

and I am not able to get the link like http://127.0.0.1/Mysongs/Mysong2.ogg, but only <a href="Mysong1.mkv">Mysong1.mp3</a> 10-May-2016 07:24

Should I be using the sys.argv[1] to construct the href link like

print sys.argv[1] + link.get('href')

or is there some better way to get this?

Edit:: Current output is

Mysong1.mp3
Mysong2.ogg

Expected output:

http://127.0.0.1/Mysong1.mp3
http://127.0.0.1/Mysong1.0gg
init
  • 585
  • 5
  • 11
  • What exactly do you want as output? – Ani Menon Jun 04 '16 at 12:56
  • Thank you @AniMenon, I was intending to pass these urls to an external download accelerator that downloads chunks consecutively. So I was looking for urls instead of just plain text. I could create the url by concatenating the base url and the text, but I was wondering if this is the only way, and there is a pythonic way or a module support. – init Jun 04 '16 at 13:11
  • post expected output and current output. – Ani Menon Jun 04 '16 at 13:13
  • Done, @AniMenon. Thank you. – init Jun 04 '16 at 13:17
  • @init Beautiful soup always returns whatever is in the href, so if you want the output like that then adding the base url is your only option. – Ani Menon Jun 04 '16 at 13:21
  • Thank you @AniMenon, if you post it as an answer, I will accept it. I was unsure whether there are any other methods that the module has, that could be used. – init Jun 04 '16 at 13:23

1 Answers1

1

Yes your only option is to add the base url. But don't add it this way:

print sys.argv[1] + link.get('href')

Use this:

from urlparse import urljoin
urljoin('http://something.com/random/abc.html', '../../music/MySong.mp3')

In your method, the relative paths may not be identified & handled, urljoin handles it.

Ani Menon
  • 27,209
  • 16
  • 105
  • 126