-2

I need python regex to extract url's from html, example html code :

<a href=""http://a0c5e.site.it/r"" target=_blank><font color=#808080>MailUp</font></a>
<a href=""http://www.site.it/prodottiLLPP.php?id=1"" class=""txtBlueGeorgia16"">Prodotti</a>
<a href=""http://www.site.it/terremoto.php"" target=""blank"" class=""txtGrigioScuroGeorgia12"">Terremoto</a>
<a class='mini' href='http://www.site.com/remove/professionisti.aspx?Id=65&Code=xhmyskwzse'>clicca qui.</a>`

I need extract only:

 http://a0c5e.site.it/r
 http://www.site.it/prodottiLLPP.php?id=1
 http://www.site.it/terremoto.php
 http://www.site.com/remove/professionisti.aspx?Id=65&Code=xhmyskwzse
Chris Seymour
  • 83,387
  • 30
  • 160
  • 202
AutoSoft
  • 191
  • 1
  • 1
  • 11
  • 2
    Welcome to Stack Overflow! It looks like you want us to write some code for you. While many users are willing to produce code for a coder in distress, they usually only help when the poster has already tried to solve the problem on their own. A good way to demonstrate this effort is to include the code you've written so far, example input (if there is any), the expected output, and the output you actually get (console output, stack traces, compiler errors - whatever is applicable). The more detail you provide, the more answers you are likely to receive. – Martijn Pieters Dec 16 '12 at 17:46
  • 1
    Did you actually meant double quotes in href field. – Sushant Gupta Dec 16 '12 at 17:47
  • 1. See @MartijnPieters' answer. 2. [**Don't use a regex**](http://stackoverflow.com/a/1732454/1248554) for parsing html! – BrtH Dec 16 '12 at 17:51
  • [A fast way to extract all ANCHORs from HTML in python](http://stackoverflow.com/q/13126600) – Anonymous Coward Dec 16 '12 at 17:55

3 Answers3

2

Regex might solve your problem, but consider using BeautifulSoup

>>> html = """<a href="http://a0c5e.site.it/r" target=_blank><font color=#808080>MailUp</font></a>
<a href="http://www.site.it/prodottiLLPP.php?id=1" class=""txtBlueGeorgia16"">Prodotti</a>
<a href="http://www.site.it/terremoto.php" target=""blank"" class=""txtGrigioScuroGeorgia12"">Terremoto</a>
<a class='mini' href='http://www.site.com/remove/professionisti.aspx?Id=65&Code=xhmyskwzse'>clicca qui.</a>`"""
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> [e['href'] for e in soup.findAll('a')]
[u'http://a0c5e.site.it/r', u'http://www.site.it/prodottiLLPP.php?id=1', u'http://www.site.it/terremoto.php', u'http://www.site.com/remove/professionisti.aspx?Id=65&Code=xhmyskwzse']

From Jon Clements

soup.findAll('a', {'href': True}) 

On a different note, your href quotaion in your html snippet is incorrect.

Abhijit
  • 62,056
  • 18
  • 131
  • 204
  • Good answer but it's kind of spoon feeding. Now he will be able to copy and paste the code and will come up with another question instead of reading the documents :) – Eren T. Dec 16 '12 at 17:53
  • @ErenT.: I am not sure he was aware of BeautifulSoup. Is more to convince him the power of bs and to think beyond regex. – Abhijit Dec 16 '12 at 17:54
  • `soup.findAll('a', {'href': True})` is a bit more robust :) – Jon Clements Dec 16 '12 at 17:55
  • I think part of the point is that he wants to work with wrongly-quoted stuff. Sometimes you can't control whether people feed you bad data. – Ishpeck Dec 16 '12 at 18:02
1

Observe

Python 2.7.3 (default, Sep  4 2012, 20:19:03) 
[GCC 4.2.1 20070831 patched [FreeBSD]] on freebsd9
Type "help", "copyright", "credits" or "license" for more information.
>>> junk=''' <a href=""http://a0c5e.site.it/r"" target=_blank><font color=#808080>MailUp</font></a>
... <a href=""http://www.site.it/prodottiLLPP.php?id=1"" class=""txtBlueGeorgia16"">Prodotti</a>
... <a href=""http://www.site.it/terremoto.php"" target=""blank"" class=""txtGrigioScuroGeorgia12"">Terremoto</a>
... <a class='mini' href='http://www.site.com/remove/professionisti.aspx?Id=65&Code=xhmyskwzse'>clicca qui.</a>`'''
>>> import re
>>> pat=re.compile(r'''http[\:/a-zA-Z0-9\.\?\=&]*''')
>>> pat.findall(junk)
['http://a0c5e.site.it/r', 'http://www.site.it/prodottiLLPP.php?id=1', 'http://www.site.it/terremoto.php', 'http://www.site.com/remove/professionisti.aspx?Id=65&Code=xhmyskwzse']

Might want to add % so you can catch other escapes.

Ishpeck
  • 2,001
  • 1
  • 19
  • 21
0

You can use BeautifulSoup library to manipulate/extract information on HTML.

I don't recommend you to use regular expressions to parse HTML data. HTML is not regular, it's context-free grammar. When a link structure changes, HTML can be valid but your regex may not , and you will have to write the expression again. Using BeautifulSoup is a decent way to extract information.

Eren T.
  • 320
  • 2
  • 8