1

So I have this html piece:

<p class="tbtx">


                              MWF



<br></br>

TH
</p>

which is completely mangled it seems. I need to extract the data i.e. ['MWF', 'TH'].

The only solution I could think of is to replace all newlines and spaces in the html, then split it at
and rebuild html structure and then extract .text but it's a bit ridiculous.

Any proper solutions for this?

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Granitosaurus
  • 20,530
  • 5
  • 57
  • 82

3 Answers3

3

.stripped_strings is what you are looking for - it removes unneccessary whitespace and returns the strings.

Demo:

from bs4 import BeautifulSoup

data = """<p class="tbtx">


                              MWF



<br></br>

TH
</p>"""

soup = BeautifulSoup(data)
print list(soup.stripped_strings)  # prints [u'MWF', u'TH']
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
TheDarkTurtle
  • 413
  • 1
  • 3
  • 12
  • People should notice that it doesn't work with all versions of BeautifulSoup. Worked for me only after I've installed BeautifulSoup4 – Itay Dec 06 '14 at 13:46
1

You can do this using filter and BeautifulSoup to pull out just the text from your HTML snippet.

from bs4 import BeautifulSoup

html = """<p class="tbtx">


                              MWF



<br></br>

TH
</p>"""

print filter(None,BeautifulSoup(html).get_text().strip().split("\n"))

Outputs:

[u'MWF', u'TH']
Andy
  • 49,085
  • 60
  • 166
  • 233
-3

I would recommend extracting text using Regular Expressions

For instance if your html was as you noted:

"
<p class="tbtx">


                              MWF



<br></br>

TH
</p>
"

We can see that the desired text ("MWF","TH") is surround by whitespace characters.

Thus the regular expression("\s\w+\s") reads "find any set of word characters that are surrounded by white space characters" and would identify the desired text.

Here is a cheat sheet for creating Regular Expressions: http://regexlib.com/CheatSheet.aspx?AspxAutoDetectCookieSupport=1

And you can test your Regular Expression on desired text here: http://regexpal.com/

Andy
  • 49,085
  • 60
  • 166
  • 233
Jordatech
  • 7
  • 5