0

I am trying to remove all the html surrounding the data that I seek from a webpage so that all that is left is the raw data that I will then be able to input into a database. so if I have something like:

<p class="location"> Atlanta, GA </p>

The following code would return

Atlanta, GA </p>

But what I expect is not what is returned. This is a more specific solution to the basic problem I found here. Any help would be appreciated, thanks! Code is found below.

def delHTML(self, html):
    """
    html is a list made up of items with data surrounded by html
    this function should get rid of the html and return the data as a list
    """

    for n,i in enumerate(html):
        if i==re.match('<p class="location">',str(html[n])):
            html[n]=re.sub('<p class="location">', '', str(html[n]))

    return html
Community
  • 1
  • 1
mnky9800n
  • 1,113
  • 2
  • 15
  • 33
  • 4
    I believe this is appropriate: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – tzaman Sep 12 '12 at 22:52
  • You should really adapt [this Java code](http://thedailywtf.com/Articles/How-to-Extract-Text-from-HTML-%28Experts-Only%29.aspx)... (The Daily WTF has never been so timely!) – Matteo Italia Sep 12 '12 at 22:53
  • Seriously: what you want probably is a SAX HTML parser. Python includes [`HTMLParser`](http://docs.python.org/library/htmlparser.html), it seems like a good solution for your problem. – Matteo Italia Sep 12 '12 at 22:56
  • BeautifulSoup is a good way to parse HTML in Python. –  Sep 12 '12 at 23:00

2 Answers2

2

As rightfully pointed out in the comments, you should be using a specific library to parse HTML and extract text, here are some examples:

Thomas Orozco
  • 53,284
  • 11
  • 113
  • 116
0

Assuming all you want is to extract the data contained in <p class="location"> tags, you could use a quick & dirty (but correct) approach with the Python HTMLParser module (a simple HTML SAX parser), like this:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    PLocationID=0
    PCount=0
    buf=""
    out=[]

    def handle_starttag(self, tag, attrs):
        if tag=="p":
            self.PCount+=1
            if ("class", "location") in attrs and self.PLocationID==0:
                self.PLocationID=self.PCount

    def handle_endtag(self, tag):
        if tag=="p":
            if self.PLocationID==self.PCount:
                self.out.append(self.buf)
                self.buf=""
                self.PLocationID=0
            self.PCount-=1

    def handle_data(self, data):
        if self.PLocationID:
            self.buf+=data

# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed("""
<html>
<body>
<p>This won't appear!</p>
<p class="location">This <b>will</b></p>
<div>
<p class="location">This <span class="someclass">too</span></p>
<p>Even if <p class="location">nested Ps <p class="location"><b>shouldn't</b> <p>be allowed</p></p> <p>this will work</p></p> (this last text is out!)</p>
</div>
</body>
</html>
""")
print parser.out

Output:

['This will', 'This too', "nested Ps shouldn't be allowed this will work"]

This will extract all the text contained inside any <p class="location"> tag, stripping all the tags inside it. Separate tags (if not nested - which shouldn't be allowed anyhow for paragraphs) will have a separate entry in the out list.

Notice that for more complex requirements this can easily get out of hand; in those cases a DOM parser is way more appropriate.

Matteo Italia
  • 123,740
  • 17
  • 206
  • 299