I am trying to use beautifulsoup to first remove the <a>
tags in the html string, but keep it's content. After that I would like to remove all tags and replace them with new lines.
The strip_tags function is from This post.
Here is an example of what I am trying to do:
text = "<p>This is a <a>test</a></p>"
soup = strip_tags(text, ["a"])
plain_text = soup.get_text("\n")
print(plain_text)
For some reason the output is u'This is a \ntest'
. If the <a>
tag is stripped out already why does it think it is still there?
The expected output is This is a test
.
A more complex example:
<p>First</p><a>Link</a><p>Second</p>
How can I separate between <p>
tags, and still be able to strip the <a>
tag out?
Indeed if you print soup.encode_contents()
, no <a>
is there.