I'm trying to use BeautifulSoup to get information from HTML files. After narrowing down the 'soup' through soup.table.table.tbody.find_all('table', attrs={'cellspacing' : '0'})
, this is the kind of html I have to work with (I've removed some of the html to save space):
<table border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody>
<tr><td>SOME CITY</td></tr>
</tbody>
</table>, <table border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody>
<tr><td>SOME ADDRESS</td></tr>
<tr><td>SOME ADDRESS 2</td></tr>
<tr><td>SOME CITY, STATE, ZIPCODE</td></tr>
<tr><td><a class="icon_arrow" href="http://SOMEURL" onclick="window.open('http://SOMEWEBSITE'); return false;" target="_blank">Visit website</a></td></tr>
</tbody>
</table>, <table border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody>
<tr><td>SOME NAME </td></tr>
</tbody>
</table>, <table border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody>
<tr><td nowrap="nowrap">SOME TELEPHONE</td></tr>
<tr><td><a class="icon_arrow" href="/mcs/iframecontactUsFormAction.do?toEmail=SOME@EMAIL.COM" onclick="window.open(%=contactUs%); return false;" target="_blank">E-mail Me</a></td></tr>
</tbody>
</table>, <table border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody>
<tr><td>SOME CTIY</td></tr>
</tbody>
</table>, <table border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody>
<tr><td>SOME ADDRESS</td></tr>
<tr><td>SOME ADDRESS2</td></tr>
<tr><td>SOME CITY, STATE, ZIPCODE</td></tr>
</tbody>
</table>, <table border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody>
<tr><td>SOME NAME </td></tr>
</tbody>
</table>, <table border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody>
<tr><td nowrap="nowrap">SOME TELEPHONE</td></tr>
<tr><td><a class="icon_arrow" href="/mcs/iframecontactUsFormAction.do?toEmail=SOME@EMAIL.COM" onclick="window.open(%=contactUs%); return false;" target="_blank">E-mail Me</a></td></tr>
</tbody>
</table>
</table>
The format for these pages is similar, some with more or less records. The information I am interested in extracting is SOME CITY, SOME ADDRESS, SOME ADDRESS2, SOME CITY, STATE, ZIPCODE, NAME, SOME TELEPHONE, and SOME@EMAIL.COM (though this can be skipped).
Looking at the html, it appears that all of the relevant information is between tags. I am just having a difficult time getting BS to find those tags to extract the information.