Apologies if there is a duplicate, I searched but couldn't find an answer. I was writing a scraper to scrape a default directory index page served by my webserver. The html looks like this
<html>
<head><title>Index of /Mysongs</title></head>
<body bgcolor="white">
<h1>Index of /Mysongs</h1><hr><pre><a href="../">../</a>
<a href="Mysong1.mkv">Mysong1.mp3</a> 10-May-2016 07:24 183019
<a href="Mysong2.mkv">Mysong2.ogg</a> 10-May-2016 07:27 177205
The href
link looks like a text only, and not a url (<a href="Mysong2.mkv">
), but on pointing to the text, it shows the link in the browser's status bar (http://127.0.0.1/Mysongs/Mysong2.ogg
)
I tried to extract the url using beautifulsoup, like this
#!/usr/bin/python
import httplib2
import sys
from BeautifulSoup import BeautifulSoup, SoupStrainer
http = httplib2.Http()
status, response = http.request(sys.argv[1])
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
print link.get('href')
and I am not able to get the link like http://127.0.0.1/Mysongs/Mysong2.ogg
, but only <a href="Mysong1.mkv">Mysong1.mp3</a> 10-May-2016 07:24
Should I be using the sys.argv[1]
to construct the href link like
print sys.argv[1] + link.get('href')
or is there some better way to get this?
Edit:: Current output is
Mysong1.mp3
Mysong2.ogg
Expected output:
http://127.0.0.1/Mysong1.mp3
http://127.0.0.1/Mysong1.0gg