extracting text from mangled html tag with
separating the elements

Question

So I have this html piece:

<p class="tbtx">


                              MWF



<br></br>

TH
</p>

which is completely mangled it seems. I need to extract the data i.e. ['MWF', 'TH'].

The only solution I could think of is to replace all newlines and spaces in the html, then split it at
and rebuild html structure and then extract .text but it's a bit ridiculous.

Any proper solutions for this?

score 3 · Accepted Answer · edited Jul 24 '14 at 15:09

3

.stripped_strings is what you are looking for - it removes unneccessary whitespace and returns the strings.

Demo:

from bs4 import BeautifulSoup

data = """<p class="tbtx">


                              MWF



<br></br>

TH
</p>"""

soup = BeautifulSoup(data)
print list(soup.stripped_strings)  # prints [u'MWF', u'TH']

edited Jul 24 '14 at 15:09

alecxe

462,703
120
1,088
1,195

answered Jul 24 '14 at 15:08

TheDarkTurtle

413
1
3
12

People should notice that it doesn't work with all versions of BeautifulSoup. Worked for me only after I've installed BeautifulSoup4 – Itay Dec 06 '14 at 13:46

score 1 · Answer 2 · answered Jul 24 '14 at 15:06

You can do this using filter and BeautifulSoup to pull out just the text from your HTML snippet.

from bs4 import BeautifulSoup

html = """<p class="tbtx">


                              MWF



<br></br>

TH
</p>"""

print filter(None,BeautifulSoup(html).get_text().strip().split("\n"))

Outputs:

[u'MWF', u'TH']

score -3 · Answer 3 · edited Jul 24 '14 at 15:35

I would recommend extracting text using Regular Expressions

For instance if your html was as you noted:

"
<p class="tbtx">


                              MWF



<br></br>

TH
</p>
"

We can see that the desired text ("MWF","TH") is surround by whitespace characters.

Thus the regular expression("\s\w+\s") reads "find any set of word characters that are surrounded by white space characters" and would identify the desired text.

Here is a cheat sheet for creating Regular Expressions: http://regexlib.com/CheatSheet.aspx?AspxAutoDetectCookieSupport=1

And you can test your Regular Expression on desired text here: http://regexpal.com/

extracting text from mangled html tag with separating the elements

3 Answers3

extracting text from mangled html tag with
separating the elements