The following code doesn't return a single non-empty urlparse.netloc, or urlparse.scheme. The scheme and netloc are prepended to the path component. What am I doing wrong, please?
#! /usr/bin/python
# -*- coding: UTF-8 -*-
from urllib import urlopen
from urlparse import urlparse, urljoin
import re
link_exp = re.compile("href=(.+?)(?:'|\")", re.UNICODE)
flux = urlopen("http://www.w3.org")
links = [urlparse(x) for x in link_exp.findall(flux.read())]
for x in links :
print x
This extracts every (? maybe my regex is wrong) url, and prints it, except 'http://' is always in the path, rather than in the scheme. How come? And I should probably reimplement the urlparse functionality when I am done with solving this, as this is a course exercice, not a real world scenario. Sorry for not being clearer on this!