I have some source where I am trying to remove some tags, I do know that using regular expression for removing tags and such is not advised but figured this would be the easiest route to take.
What I need to do is remove all img
and a
tags along with the contents of the a
tags that are only inside a p
tag but I am unsure how to do this using regular expression.
For example when it comes across:
<p><img src="center.jpg"><a href="?center">center</a>TEXT<img src="right.jpg"><a href="?rightspan">right</a> MORE TEXT<img src="another.jpg"></p>
The output should be the following where all a
tags and content and img
tags are removed.
<p>TEXT MORE TEXT</p>
The problem is like I stated i'm not sure how to do this, and my regular expression removes all of the a
and img
tags in the source, not just the ones inside of a p
tag.
re.sub(r'<(img|a).*?>|</a>', '', text)