2

The following code doesn't return a single non-empty urlparse.netloc, or urlparse.scheme. The scheme and netloc are prepended to the path component. What am I doing wrong, please?

#! /usr/bin/python
# -*- coding: UTF-8 -*-

from urllib import urlopen  
from urlparse import urlparse, urljoin 
import re   
link_exp = re.compile("href=(.+?)(?:'|\")", re.UNICODE)  

flux = urlopen("http://www.w3.org") 
links = [urlparse(x) for x in link_exp.findall(flux.read())]
for x in links : 
    print x

This extracts every (? maybe my regex is wrong) url, and prints it, except 'http://' is always in the path, rather than in the scheme. How come? And I should probably reimplement the urlparse functionality when I am done with solving this, as this is a course exercice, not a real world scenario. Sorry for not being clearer on this!

pouzzler
  • 1,800
  • 2
  • 20
  • 32
  • 1
    Regex, HTML, [bad idea](http://stackoverflow.com/a/1732454/398968) -- use [`BeautifulSoup`](http://www.crummy.com/software/BeautifulSoup/bs4/doc/). – Katriel Jan 31 '13 at 18:11
  • not answering questions while trying to look cryptic, teenage idea. The code works, except that netloc and scheme are deported to path. I don't think regex should be blamed for this, but willing to be proven wrong. – pouzzler Jan 31 '13 at 18:12
  • 1
    Won't `link_exp.findall()` find strings like `href="http://example.com` – millimoose Jan 31 '13 at 18:13
  • What are you expecting to do with this? – ATOzTOA Jan 31 '13 at 18:13
  • 1
    @pouzzler Way to go alienating people who bothered to read your question. And actually provide good pointers. – millimoose Jan 31 '13 at 18:13
  • 1
    @pouzzler you're right, sorry. In full: you shouldn't use regular expressions to extract the links from an HTML page, because they aren't powerful enough to understand all the weird ways in which HTML can be valid. Instead, you should install and use a library designed to parse HTML, and extract the links from _that_. You can then call `urlparse` on those links. – Katriel Jan 31 '13 at 18:14
  • I would like to know what is wrong. This is an exercice, therefore I should probably reimplement urlparse also, and I don't see how to implement parsing without regex. Maybe my regex is wrong, but the answer wasn't an answer to my question. – pouzzler Jan 31 '13 at 18:15
  • To clarify, I'm asking a scholarly question. I am not interested in a real-world solution, but in an answer, all the best, guys. – pouzzler Jan 31 '13 at 18:15
  • And please excuse my temper, I've been doing entirely too much regex this afternoon, which is no excuse, sorry again. – pouzzler Jan 31 '13 at 18:22

2 Answers2

2

Your regex is wrong:

x = "<a href='http://www.bbcnews.com'>foo</a>"
link_exp.findall(x)
# ["'http://www.bbcnews.com"]

Note that you're including the opening quote.

Katriel
  • 120,462
  • 19
  • 136
  • 170
  • FYI the way to debug this sort of thing is to separate out all the nested function calls of `[urlparse(x) for x in link_exp.findall(flux.read())]` and step through with a debugger, looking at each in turn. – Katriel Jan 31 '13 at 18:27
0

Use this:

link_exp = re.compile(r"href=\"(.+?)(?:'|\")", re.UNICODE)  

Output:

...
ParseResult(scheme='http', netloc='ev.buaa.edu.cn', path='/', params='', query='', fragment='')
...
ATOzTOA
  • 34,814
  • 22
  • 96
  • 117