python url extract from html

Question

I need python regex to extract url's from html, example html code :

<a href=""http://a0c5e.site.it/r"" target=_blank><font color=#808080>MailUp</font></a>
<a href=""http://www.site.it/prodottiLLPP.php?id=1"" class=""txtBlueGeorgia16"">Prodotti</a>
<a href=""http://www.site.it/terremoto.php"" target=""blank"" class=""txtGrigioScuroGeorgia12"">Terremoto</a>
<a class='mini' href='http://www.site.com/remove/professionisti.aspx?Id=65&Code=xhmyskwzse'>clicca qui.</a>`

I need extract only:

 http://a0c5e.site.it/r
 http://www.site.it/prodottiLLPP.php?id=1
 http://www.site.it/terremoto.php
 http://www.site.com/remove/professionisti.aspx?Id=65&Code=xhmyskwzse

Welcome to Stack Overflow! It looks like you want us to write some code for you. While many users are willing to produce code for a coder in distress, they usually only help when the poster has already tried to solve the problem on their own. A good way to demonstrate this effort is to include the code you've written so far, example input (if there is any), the expected output, and the output you actually get (console output, stack traces, compiler errors - whatever is applicable). The more detail you provide, the more answers you are likely to receive. — Martijn Pieters, Dec 16 '12 at 17:46
1. See @MartijnPieters' answer. 2. [**Don't use a regex**](http://stackoverflow.com/a/1732454/1248554) for parsing html! — BrtH, Dec 16 '12 at 17:51
[A fast way to extract all ANCHORs from HTML in python](http://stackoverflow.com/q/13126600) — Anonymous Coward, Dec 16 '12 at 17:55

Abhijit · Answer 1 · 2012-12-16T17:57:14.433

2

Regex might solve your problem, but consider using BeautifulSoup

>>> html = """<a href="http://a0c5e.site.it/r" target=_blank><font color=#808080>MailUp</font></a>
<a href="http://www.site.it/prodottiLLPP.php?id=1" class=""txtBlueGeorgia16"">Prodotti</a>
<a href="http://www.site.it/terremoto.php" target=""blank"" class=""txtGrigioScuroGeorgia12"">Terremoto</a>
<a class='mini' href='http://www.site.com/remove/professionisti.aspx?Id=65&Code=xhmyskwzse'>clicca qui.</a>`"""
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> [e['href'] for e in soup.findAll('a')]
[u'http://a0c5e.site.it/r', u'http://www.site.it/prodottiLLPP.php?id=1', u'http://www.site.it/terremoto.php', u'http://www.site.com/remove/professionisti.aspx?Id=65&Code=xhmyskwzse']

From Jon Clements

soup.findAll('a', {'href': True})

On a different note, your href quotaion in your html snippet is incorrect.

edited Dec 16 '12 at 17:57

answered Dec 16 '12 at 17:51

Abhijit

62,056
18
131
204

Good answer but it's kind of spoon feeding. Now he will be able to copy and paste the code and will come up with another question instead of reading the documents :) – Eren T. Dec 16 '12 at 17:53
@ErenT.: I am not sure he was aware of BeautifulSoup. Is more to convince him the power of bs and to think beyond regex. – Abhijit Dec 16 '12 at 17:54
`soup.findAll('a', {'href': True})` is a bit more robust :) – Jon Clements Dec 16 '12 at 17:55
I think part of the point is that he wants to work with wrongly-quoted stuff. Sometimes you can't control whether people feed you bad data. – Ishpeck Dec 16 '12 at 18:02

Ishpeck · Accepted Answer · 2012-12-16T17:59:27.940

Observe

Python 2.7.3 (default, Sep  4 2012, 20:19:03) 
[GCC 4.2.1 20070831 patched [FreeBSD]] on freebsd9
Type "help", "copyright", "credits" or "license" for more information.
>>> junk=''' <a href=""http://a0c5e.site.it/r"" target=_blank><font color=#808080>MailUp</font></a>
... <a href=""http://www.site.it/prodottiLLPP.php?id=1"" class=""txtBlueGeorgia16"">Prodotti</a>
... <a href=""http://www.site.it/terremoto.php"" target=""blank"" class=""txtGrigioScuroGeorgia12"">Terremoto</a>
... <a class='mini' href='http://www.site.com/remove/professionisti.aspx?Id=65&Code=xhmyskwzse'>clicca qui.</a>`'''
>>> import re
>>> pat=re.compile(r'''http[\:/a-zA-Z0-9\.\?\=&]*''')
>>> pat.findall(junk)
['http://a0c5e.site.it/r', 'http://www.site.it/prodottiLLPP.php?id=1', 'http://www.site.it/terremoto.php', 'http://www.site.com/remove/professionisti.aspx?Id=65&Code=xhmyskwzse']

Might want to add % so you can catch other escapes.

*Applauds* Thank you master Ishpeck for showing us the ancient martial art of simple regex. (^u^) — Devyn Collier Johnson, Aug 18 '13 at 00:58

score 0 · Answer 3 · answered Dec 16 '12 at 17:50

You can use BeautifulSoup library to manipulate/extract information on HTML.

I don't recommend you to use regular expressions to parse HTML data. HTML is not regular, it's context-free grammar. When a link structure changes, HTML can be valid but your regex may not , and you will have to write the expression again. Using BeautifulSoup is a decent way to extract information.

python url extract from html

3 Answers3

Linked