Call text but totally exclude tables

Question

I am using Beautiful Soup to load an XMl. All I need is the text, ignoring the tags, and the text attribute words nice.

However, I would like to totally exclude anything within <table><\table> tags. I had the idea of substituting everything in between with a regex, but I am wondering whether there is a cleaner solution partly because Don't parse [X]HTML with regex!. For instance:

s =""" <content><p>Hasselt ( ) is a <link target="Belgium">Belgian</link> <link target="city">city</link> and <link target="Municipalities in Belgium">municipality</link>. 
<table><cell>Passenger growth
<cell>Year</cell><cell>Passengers</cell><cell>Percentage </cell></cell>
<cell>1996</cell><cell>360 000</cell><cell>100%</cell>
<cell>1997</cell><cell>1 498 088</cell><cell>428%</cell>
</table>"""
clean = Soup(s)
print clean.text

will give

Hasselt ( ) is a Belgian city and municipality. 
Passenger growth
YearPassengersPercentage 
1996360 000100%
19971 498 088428%

whereas I only want:

Hasselt ( ) is a Belgian city and municipality.

score 1 · Accepted Answer · answered Sep 22 '16 at 16:57

You can locate the content element and remove all table elements from it, then get the text:

from bs4 import BeautifulSoup

s =""" <content><p>Hasselt ( ) is a <link target="Belgium">Belgian</link> <link target="city">city</link> and <link target="Municipalities in Belgium">municipality</link>.
<table><cell>Passenger growth
<cell>Year</cell><cell>Passengers</cell><cell>Percentage </cell></cell>
<cell>1996</cell><cell>360 000</cell><cell>100%</cell>
<cell>1997</cell><cell>1 498 088</cell><cell>428%</cell>
</table>"""
soup = BeautifulSoup(s, "xml")

content = soup.content
for table in content("table"):
    table.extract()

print(content.get_text().strip())

Prints:

Hasselt ( ) is a Belgian city and municipality.

You must have started writing code before the ink was dry in the question ;) — Padraic Cunningham, Sep 22 '16 at 16:58
@PadraicCunningham :) actually, have `bs4` handy code snippets prepared. We are doing serious sports here! Thanks. — alecxe, Sep 22 '16 at 16:59

Call text but totally exclude tables

1 Answers1