urlparse completely failing on every url

Question

The following code doesn't return a single non-empty urlparse.netloc, or urlparse.scheme. The scheme and netloc are prepended to the path component. What am I doing wrong, please?

#! /usr/bin/python
# -*- coding: UTF-8 -*-

from urllib import urlopen  
from urlparse import urlparse, urljoin 
import re   
link_exp = re.compile("href=(.+?)(?:'|\")", re.UNICODE)  

flux = urlopen("http://www.w3.org") 
links = [urlparse(x) for x in link_exp.findall(flux.read())]
for x in links : 
    print x

This extracts every (? maybe my regex is wrong) url, and prints it, except 'http://' is always in the path, rather than in the scheme. How come? And I should probably reimplement the urlparse functionality when I am done with solving this, as this is a course exercice, not a real world scenario. Sorry for not being clearer on this!

Regex, HTML, [bad idea](http://stackoverflow.com/a/1732454/398968) -- use [`BeautifulSoup`](http://www.crummy.com/software/BeautifulSoup/bs4/doc/). — Katriel, Jan 31 '13 at 18:11
not answering questions while trying to look cryptic, teenage idea. The code works, except that netloc and scheme are deported to path. I don't think regex should be blamed for this, but willing to be proven wrong. — pouzzler, Jan 31 '13 at 18:12
Won't `link_exp.findall()` find strings like `href="http://example.com` — millimoose, Jan 31 '13 at 18:13
@pouzzler Way to go alienating people who bothered to read your question. And actually provide good pointers. — millimoose, Jan 31 '13 at 18:13
@pouzzler you're right, sorry. In full: you shouldn't use regular expressions to extract the links from an HTML page, because they aren't powerful enough to understand all the weird ways in which HTML can be valid. Instead, you should install and use a library designed to parse HTML, and extract the links from _that_. You can then call `urlparse` on those links. — Katriel, Jan 31 '13 at 18:14
I would like to know what is wrong. This is an exercice, therefore I should probably reimplement urlparse also, and I don't see how to implement parsing without regex. Maybe my regex is wrong, but the answer wasn't an answer to my question. — pouzzler, Jan 31 '13 at 18:15
To clarify, I'm asking a scholarly question. I am not interested in a real-world solution, but in an answer, all the best, guys. — pouzzler, Jan 31 '13 at 18:15
And please excuse my temper, I've been doing entirely too much regex this afternoon, which is no excuse, sorry again. — pouzzler, Jan 31 '13 at 18:22

score 2 · Accepted Answer · answered Jan 31 '13 at 18:16

2

Your regex is wrong:

x = "<a href='http://www.bbcnews.com'>foo</a>"
link_exp.findall(x)
# ["'http://www.bbcnews.com"]

Note that you're including the opening quote.

answered Jan 31 '13 at 18:16

Katriel

120,462
19
136
170

FYI the way to debug this sort of thing is to separate out all the nested function calls of `[urlparse(x) for x in link_exp.findall(flux.read())]` and step through with a debugger, looking at each in turn. – Katriel Jan 31 '13 at 18:27

score 0 · Answer 2 · answered Jan 31 '13 at 18:19

0

Use this:

link_exp = re.compile(r"href=\"(.+?)(?:'|\")", re.UNICODE)

Output:

...
ParseResult(scheme='http', netloc='ev.buaa.edu.cn', path='/', params='', query='', fragment='')
...

answered Jan 31 '13 at 18:19

ATOzTOA

34,814
22
96
117

urlparse completely failing on every url

2 Answers2