3

I have some source where I am trying to remove some tags, I do know that using regular expression for removing tags and such is not advised but figured this would be the easiest route to take.

What I need to do is remove all img and a tags along with the contents of the a tags that are only inside a p tag but I am unsure how to do this using regular expression.

For example when it comes across:

<p><img src="center.jpg"><a href="?center">center</a>TEXT<img src="right.jpg"><a href="?rightspan">right</a> MORE TEXT<img src="another.jpg"></p>

The output should be the following where all a tags and content and img tags are removed.

<p>TEXT MORE TEXT</p>

The problem is like I stated i'm not sure how to do this, and my regular expression removes all of the a and img tags in the source, not just the ones inside of a p tag.

re.sub(r'<(img|a).*?>|</a>', '', text)
J. Stedam
  • 33
  • 4

1 Answers1

8

Your regular expression indeed will remove all tags without using some type of assertion. Although you possibly could use regular expression to perform this, I do advise not going this route for many reasons.

You could simply use BeautifulSoup to pass a list of what to remove.

>>> from BeautifulSoup import BeautifulSoup
>>> html = '<p><img src="center.jpg"><a href="?center">center</a>TEXT<img src="right.jpg"><a href="?rightspan">right</a> MORE TEXT<img src="another.jpg"></p>'
>>> soup = BeautifulSoup(html)
>>> for m in soup.findAll(['a', 'img']):
...   if m.parent.name == 'p':
...      m.replaceWith('')

>>> print soup

<p>TEXT MORE TEXT</p>

Note: This will replace all <a>, </a> and <img> elements (including content) that are inside of a <p> element leaving the rest untouched. If you have BS4, use find_all() and replace_with()

hwnd
  • 69,796
  • 4
  • 95
  • 132
  • Noticing your good regex and looking for things to learn in some of your old posts... And +1'ing this for an instructive solution that has nothing to do with regex! – zx81 May 05 '14 at 23:42
  • @zx81 Go here [http://chat.stackoverflow.com/rooms/52067/foo](http://chat.stackoverflow.com/rooms/52067/foo) – hwnd May 05 '14 at 23:46