0

I am trying to use beautifulsoup to first remove the <a> tags in the html string, but keep it's content. After that I would like to remove all tags and replace them with new lines.

The strip_tags function is from This post.

Here is an example of what I am trying to do:

text = "<p>This is a <a>test</a></p>"
soup = strip_tags(text, ["a"])
plain_text = soup.get_text("\n")
print(plain_text)

For some reason the output is u'This is a \ntest'. If the <a> tag is stripped out already why does it think it is still there?

The expected output is This is a test.

A more complex example: <p>First</p><a>Link</a><p>Second</p>

How can I separate between <p> tags, and still be able to strip the <a> tag out?

Indeed if you print soup.encode_contents(), no <a> is there.

Community
  • 1
  • 1
rabz100
  • 751
  • 1
  • 5
  • 13
  • `u'This is a test'`. If there is no tag there should be no new line. – rabz100 Jul 08 '16 at 18:09
  • The breaks aren't because it sees an 'a' tag. It's because the NavigableString element of Soup contains multiple unicode strings and the get_text function prints the \n after every element in soup's NavigableString tree –  Jul 08 '16 at 18:14
  • The reason it is being added there is because `replacewith` when passed a simple string adds a NavigableString – keety Jul 08 '16 at 18:24
  • According to the crummy.com documentation: "A NavigableString is just like a Python Unicode string, except that it also supports some of the features described in Navigating the tree and Searching the tree." I think whats happening is you're removing the tags from the content but not fixing the tree's underlying elements. So when get_text() parses the tree it still sees extra "leafs" and puts in the "\n"s –  Jul 08 '16 at 18:50

1 Answers1

-1

The strip_tags function is from This post.

That function replaces tags with text they contain, recursively.

Thus, your '<a>test</a>' is replaced with 'test'. No '<a>' tags there.

Community
  • 1
  • 1
Daerdemandt
  • 2,281
  • 18
  • 19