python, regex to find anchor link html

Question

I need a regex in python to find a links html in a larger set of html.

so if I have:

<ul class="something">
<li id="li_id">
<a href="#" title="myurl">URL Text</a>
</li>
</ul>

I would get back:

<a href="#" title="myurl">URL Text</a>

I'd like to do it with a regex and not beautifulsoup or something similar to that. Does anyone have a snippet laying around I could use for this?

Thanks

"I'd like to do it with a regex and not beautifulsoup or something similar to that." Enjoy pounding that screw with a hammer. — Ignacio Vazquez-Abrams, Jan 21 '10 at 02:57
Seriously: **DON'T** use regular expressions to parse HTML. Just don't. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Alex Martelli, Jan 21 '10 at 02:59
Why would you like to do it with a regex and not beautifulsoup or something similar to that? — SLaks, Jan 21 '10 at 03:02

mechanical_meat · Accepted Answer · 2010-01-21T04:01:16.007

Soup is good for you:

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('''<ul class="something">
... <li id="li_id">
... <a href="#" title="myurl">URL Text</a>
... </li>
... </ul>''')

There are many arguments you can pass to the findAll method; more here. The one line below will get you started by returning a list of all links matching some conditions.

>>> soup.findAll(href='#', title='myurl')
[<a href="#" title="myurl">URL Text</a>]

Edit: based on OP's comment, added info included:

So let's say you're interested in only tags within list elements of a certain class <li class="li_class">. You could do something like this:

>>> soup = BeautifulSoup('''<li class="li_class">
    <a href="#" title="myurl">URL Text</a>
    <a href="#" title="myurl2">URL Text2</a></li><li class="foo">
    <a href="#" title="myurl3">URL Text3</a></li>''') # just some sample html

>>> for elem in soup.findAll("li", "li_class"):
...   pprint(elem.findAll('a')) # requires `from pprint import pprint`
... 
[<a href="#" title="myurl">URL Text</a>,
 <a href="#" title="myurl2">URL Text2</a>]

Soup recipe:

Download the one file required.
Place dl'd file in site-packages dir or similar.
Enjoy your soup.

Ok, lets say I only want to only find the a tags that are inside of

score 3 · Answer 2 · answered Jan 21 '10 at 03:04

3

you really shouldn't use regexes to parse html.. ever.

try beautifulsoup or lxml.

but... you asked. so a quick and naive version might look like this:

import re

html = """
<ul class="something">
<li id="li_id">
<a href="#" title="myurl">URL Text</a>
</li>
</ul>
"""

m = re.search('(<a .*>)', html)
if m:
    print m.group(1)

I can think of a lot of ways this would break.

answered Jan 21 '10 at 03:04

Corey Goldberg

59,062
28
129
143

Considering what he wants to get back, you probably want something more like `/()/`. And yes, it breaks on pretty much everything. – Anon. Jan 21 '10 at 03:06

ghostdog74 · Answer 3 · 2010-01-21T03:37:48.820

1

you can try this since your requirement is simple. No need BeautifulSoup or regex

>>> s="""
... <ul class="something">
... <li id="li_id">
... <a href="#" title="myurl">URL Text</a>
... </li>
... </ul>
... """
>>> for item in s.split("</a>"):
...    if "<a href=" in item :
...        print item [ item.find("<a href=") : ] + "</a>"
...
<a href="#" title="myurl">URL Text</a>

You can include a check of '<li class="li_class">' in the if statement as desired.

edited Jan 21 '10 at 03:37

answered Jan 21 '10 at 03:06

ghostdog74

327,991
56
259
343

2

And of course lots of perfectly correct ways to write that HTML (even just switching the title and href attributes, for example!) will make this go down in flames. What a perfectly terrible "solution"! – Alex Martelli Jan 21 '10 at 03:16
I think you all should not jump too far ahead. What OP wants to do is supposedly very simple. You guys make it too complicated! – ghostdog74 Jan 21 '10 at 03:35

python, regex to find anchor link html

3 Answers3

Linked