How to create structured array out of unstructured HTML using python

Question

ok so I have this HTML file that has data in it that contains many div tags and table tags. The div tags contain id's that relate to other div tags sections, but after each div tag section is a table section that contains the data I need. I want to be able to take this HTML file and create arrays, lists, dicts, etc... some sort of structure so that I can easily search for related info and extract what I need from it.

Example of whats in the HTML file.

<DIV class="info">      <A name="bc968f9fa2db71455f50e0c13ce50e871fS7f0e"
id="bc968f9fa2db71455f50e0c13ce50e871fS7f0e">
        <B>WORKSPACE_WEBAPP</B>&nbsp;(WORKSPACE_WEBAPP)<BR/>        <B>Object ID:
</B>&nbsp;&nbsp;bc968f9fa2db71455f50e0c13ce50e871fS7f0e<BR/>        <B>Last
Modified Date : </B>&nbsp;&nbsp;26-Sep-13 10:41:13<BR/>
        <B>Properties:</B><BR/>     </DIV>

    <TABLE class="properties">      <TR class="header"><TH>Property
Name</TH><TH>Property Value</TH></TR>
                    <TR><TD>serverName</TD><TD>FoundationServices0</TD></TR>
                    <TR><TD>context</TD><TD>workspace</TD></TR>
                    <TR><TD>isCompact</TD><TD>false</TD></TR>
                    <TR><TD>AppServer</TD><TD>WebLogic 10</TD></TR>
                    <TR><TD>port</TD><TD>28080</TD></TR>
                    <TR><TD>maintVersion</TD><TD>11.1.2.2.0.66</TD></TR>
                    <TR><TD>version</TD><TD>11.1.2.0</TD></TR>
                    <TR><TD>SSL_Port</TD><TD>28443</TD></TR>
                    <TR><TD>instance_home</TD><TD>/essdev1/app/oracle/Middleware/user_projects/epmsystem1</TD></TR>
                    <TR><TD>configureBPMUIStaticContent</TD><TD>true</TD></TR>
                    <TR><TD>validationContext</TD><TD>workspace/status</TD></TR>            </TABLE>

So I want to be able to create an array for these div sections and also contain the properties that area in the table as well within that array. I just can't wrap my head around whats the best way to do it. I know probably the answer will contain using BeautifulSoup to parse the tags. Since there is no other way to relate the table section to the div section I believe I'll have to load the file a line at a time and process it that way, unless there is an easier method? any ideas would be very helpful.

Have you taken a look at [Parsing HTML in Python](http://stackoverflow.com/questions/11709079/parsing-html-python)? — Huey, Apr 23 '15 at 14:06
Yes I've read that and many other python HTML parsing guides. I guess my biggest issue is how to control reading a div tag section then reading it's associated table section, then moving on to the next div tag section and table section until the entire file is parsed. — todd1215, Apr 23 '15 at 14:20
you could remove the div tag once it's read, then look for the next one until no more are found? — Huey, Apr 23 '15 at 14:22

score 2 · Answer 1 · answered Apr 23 '15 at 14:22

Use BeautifulSoup

Basic solution is by using join, prettify and split. Basic idea is to convert it in a text and separate the portion of interest

from bs4 import BeautifulSoup
soup = BeautifulSoup(''.join(text))
for i in soup.prettify().split('<!--Persontype-->')[1].split('<strong>'):
print '<strong>' + ''.join(i)



text= '''
<div class="clearfix">
    <!--# of ppl associated with place-->
        This is some kind of buzzword:<br />
    <br />
    <!--Persontype-->
        <strong>Hey</strong> All            <br />
Something  text here            <br />
About Something
        <br />
Mobile Version        <br />
        <br />
        <strong>MObile</strong> Nokia            <br />
Try to implement here            <br />
Simple
            <br />
hey Thanks       <br />


O/P is :

Robᵩ · Accepted Answer · 2015-04-27T16:05:22.410

First, I need to restate your question. The example shows a div tag which contains inside it an A tag. The A tag has an ID which you want to use as the a key for looking up the following table. The div tag is followed by a table. Each row of the table contains a name-value pair associated with the object identified in the previous A.

You have a page filled with multiple div tags, each of which is described by my previous paragraph.

You want to produce some data structure to conveniently access the table data and associate it with the named object?

Do I have that right?

The answer, as you prophesied, is to use BeautifulSoup. We will create a dictionary, keyed by the id attribute. Each value in the dictionary is itself a dictionary, keyed by the "Property Name" in the table.

from bs4 import BeautifulSoup
from pprint import pprint

result = {}
soup = BeautifulSoup(page)
divs = soup.find_all("div", {"class":"info"})
for div in divs:
    name = div.find("a")["id"]
    table = div.find_next("table", {"class":"properties"})
    rows = table.find_all("tr", {"class":None})
    rowd = {}
    for row in rows:
        cells = row.find_all("td")
        rowd[cells[0].text] = cells[1].text
    result[name] = rowd
pprint (result)

Or, if you prefer dict comprehensions (as I do):

result = {
    div.find("a")["id"]: {
        cells[0].text : cells[1].text
        for row in table.find_all("tr", {"class":None})
        for cells in [row.find_all("td")]
    }
    for div in soup.find_all("div", {"class":"info"})
    for table in [div.find_next("table", {"class":"properties"})]
}

pprint(result)

When pointed at your example data, this yields:

{'bc968f9fa2db71455f50e0c13ce50e871fS7f0e': {u'AppServer': u'WebLogic 10',
                                             u'SSL_Port': u'28443',
                                             u'configureBPMUIStaticContent': u'true',
                                             u'context': u'workspace',
                                             u'instance_home': u'/essdev1/app/oracle/Middleware/user_projects/epmsystem1',
                                             u'isCompact': u'false',
                                             u'maintVersion': u'11.1.2.2.0.66',
                                             u'port': u'28080',
                                             u'serverName': u'FoundationServices0',
                                             u'validationContext': u'workspace/status',
                                             u'version': u'11.1.2.0'}}

To use the data structure, simply follow the dictionaries. For example:

print result["bc968f9fa2db71455f50e0c13ce50e871fS7f0e"]["serverName"]

very elegant. This is exactly what I am trying to do. Your re-statement of my explanation is spot-on. — todd1215, Apr 27 '15 at 15:59
This line doesn't seem correct. I get an error with it. name = div.find("a")["id"] — todd1215, Apr 27 '15 at 17:13

How to create structured array out of unstructured HTML using python

2 Answers2