0

I need a regex in python to find a links html in a larger set of html.

so if I have:

<ul class="something">
<li id="li_id">
<a href="#" title="myurl">URL Text</a>
</li>
</ul>

I would get back:

<a href="#" title="myurl">URL Text</a>

I'd like to do it with a regex and not beautifulsoup or something similar to that. Does anyone have a snippet laying around I could use for this?

Thanks

Joe
  • 4,553
  • 9
  • 51
  • 57

3 Answers3

4

Soup is good for you:

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('''<ul class="something">
... <li id="li_id">
... <a href="#" title="myurl">URL Text</a>
... </li>
... </ul>''')

There are many arguments you can pass to the findAll method; more here. The one line below will get you started by returning a list of all links matching some conditions.

>>> soup.findAll(href='#', title='myurl')
[<a href="#" title="myurl">URL Text</a>]

Edit: based on OP's comment, added info included:

So let's say you're interested in only tags within list elements of a certain class <li class="li_class">. You could do something like this:

>>> soup = BeautifulSoup('''<li class="li_class">
    <a href="#" title="myurl">URL Text</a>
    <a href="#" title="myurl2">URL Text2</a></li><li class="foo">
    <a href="#" title="myurl3">URL Text3</a></li>''') # just some sample html

>>> for elem in soup.findAll("li", "li_class"):
...   pprint(elem.findAll('a')) # requires `from pprint import pprint`
... 
[<a href="#" title="myurl">URL Text</a>,
 <a href="#" title="myurl2">URL Text2</a>]

Soup recipe:

  1. Download the one file required.
  2. Place dl'd file in site-packages dir or similar.
  3. Enjoy your soup.
mechanical_meat
  • 163,903
  • 24
  • 228
  • 223
  • Ok, lets say I only want to only find the a tags that are inside of
  • . So, if the li tag doesn't have that class I don't want to return the a tag. How do I do that?
  • – Joe Jan 21 '10 at 03:27