I am using Beautiful Soup to load an XMl. All I need is the text, ignoring the tags, and the text
attribute words nice.
However, I would like to totally exclude anything within <table><\table>
tags. I had the idea of substituting everything in between with a regex, but I am wondering whether there is a cleaner solution partly because Don't parse [X]HTML with regex!. For instance:
s =""" <content><p>Hasselt ( ) is a <link target="Belgium">Belgian</link> <link target="city">city</link> and <link target="Municipalities in Belgium">municipality</link>.
<table><cell>Passenger growth
<cell>Year</cell><cell>Passengers</cell><cell>Percentage </cell></cell>
<cell>1996</cell><cell>360 000</cell><cell>100%</cell>
<cell>1997</cell><cell>1 498 088</cell><cell>428%</cell>
</table>"""
clean = Soup(s)
print clean.text
will give
Hasselt ( ) is a Belgian city and municipality.
Passenger growth
YearPassengersPercentage
1996360 000100%
19971 498 088428%
whereas I only want:
Hasselt ( ) is a Belgian city and municipality.