Beautiful Soup Parsing Error

Question

I am trying to use beautifulsoup to first remove the <a> tags in the html string, but keep it's content. After that I would like to remove all tags and replace them with new lines.

The strip_tags function is from This post.

Here is an example of what I am trying to do:

text = "<p>This is a <a>test</a></p>"
soup = strip_tags(text, ["a"])
plain_text = soup.get_text("\n")
print(plain_text)

For some reason the output is u'This is a \ntest'. If the <a> tag is stripped out already why does it think it is still there?

The expected output is This is a test.

A more complex example: First<a>Link</a>Second

How can I separate between  tags, and still be able to strip the <a> tag out?

Indeed if you print soup.encode_contents(), no <a> is there.

`u'This is a test'`. If there is no tag there should be no new line. — rabz100, Jul 08 '16 at 18:09
The breaks aren't because it sees an 'a' tag. It's because the NavigableString element of Soup contains multiple unicode strings and the get_text function prints the \n after every element in soup's NavigableString tree — , Jul 08 '16 at 18:14
The reason it is being added there is because `replacewith` when passed a simple string adds a NavigableString — keety, Jul 08 '16 at 18:24
According to the crummy.com documentation: "A NavigableString is just like a Python Unicode string, except that it also supports some of the features described in Navigating the tree and Searching the tree." I think whats happening is you're removing the tags from the content but not fixing the tree's underlying elements. So when get_text() parses the tree it still sees extra "leafs" and puts in the "\n"s — , Jul 08 '16 at 18:50

score -1 · Answer 1 · edited May 23 '17 at 11:58

-1

The strip_tags function is from This post.

That function replaces tags with text they contain, recursively.

Thus, your '<a>test</a>' is replaced with 'test'. No '<a>' tags there.

edited May 23 '17 at 11:58

Community

1
1

answered Jul 08 '16 at 17:42

Daerdemandt

2,281
18
19

Beautiful Soup Parsing Error

1 Answers1