What's the easiest way to extract the links on a web page using python without BeautifulSoup?

Question

I'm using cygwin and do not have BeautifulSoup installed.

How about installing BeatifulSoup then? Might be the easiest way :) — Sven Marnach, Dec 11 '10 at 00:11
Possibly, I just saw something in my search results that suggested it might be difficult on cygwin, possibly more difficult than doing it without BeautifulSoup. — jonderry, Dec 11 '10 at 00:14
Actually, I just installed it pretty easily. It's good to know the other ways though. — jonderry, Dec 11 '10 at 00:46

score 1 · Accepted Answer · edited May 23 '17 at 12:26

1

edited May 23 '17 at 12:26

Community

answered Dec 11 '10 at 00:34

nate c

score 0 · Answer 2 · answered Dec 11 '10 at 00:43

0

If you don't care much about performance you can use regular expressions:

import re
linkre = re.compile(r"""href=["']([^"']+)["']""")
links = linkre.findall(your_html)

If you just want links like in http:// links then change the expression to:

linkre = re.compile(r"""href=["']http:([^"']+)["']""")

Or you can put "' as optional if by some chance you have html without them around the links.

answered Dec 11 '10 at 00:43

Piotr Lopusiewicz

Regular expressions would likely actually be faster than doing proper HTML parsing, so I don't think this is a matter of performance but rather correctness. – Liquid_Fire Dec 11 '10 at 01:56

2 Answers2