0

I'm trying to use BeautifulSoup to get information from HTML files. After narrowing down the 'soup' through soup.table.table.tbody.find_all('table', attrs={'cellspacing' : '0'}), this is the kind of html I have to work with (I've removed some of the html to save space):

<table border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody>
<tr><td>SOME CITY</td></tr>
</tbody>
</table>, <table border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody>
<tr><td>SOME ADDRESS</td></tr>
<tr><td>SOME ADDRESS 2</td></tr>
<tr><td>SOME CITY, STATE, ZIPCODE</td></tr>
<tr><td><a class="icon_arrow" href="http://SOMEURL" onclick="window.open('http://SOMEWEBSITE'); return false;" target="_blank">Visit website</a></td></tr>
</tbody>
</table>, <table border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody>
<tr><td>SOME NAME </td></tr>
</tbody>
</table>, <table border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody>
<tr><td nowrap="nowrap">SOME TELEPHONE</td></tr>
<tr><td><a class="icon_arrow" href="/mcs/iframecontactUsFormAction.do?toEmail=SOME@EMAIL.COM" onclick="window.open(%=contactUs%); return false;" target="_blank">E-mail Me</a></td></tr>
</tbody>
</table>, <table border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody>
<tr><td>SOME CTIY</td></tr>
</tbody>
</table>, <table border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody>
<tr><td>SOME ADDRESS</td></tr>
<tr><td>SOME ADDRESS2</td></tr>
<tr><td>SOME CITY, STATE, ZIPCODE</td></tr>
</tbody>
</table>, <table border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody>
<tr><td>SOME NAME </td></tr>
</tbody>
</table>, <table border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody>
<tr><td nowrap="nowrap">SOME TELEPHONE</td></tr>
<tr><td><a class="icon_arrow" href="/mcs/iframecontactUsFormAction.do?toEmail=SOME@EMAIL.COM" onclick="window.open(%=contactUs%); return false;" target="_blank">E-mail Me</a></td></tr>
</tbody>
</table>
</table>

The format for these pages is similar, some with more or less records. The information I am interested in extracting is SOME CITY, SOME ADDRESS, SOME ADDRESS2, SOME CITY, STATE, ZIPCODE, NAME, SOME TELEPHONE, and SOME@EMAIL.COM (though this can be skipped).

Looking at the html, it appears that all of the relevant information is between tags. I am just having a difficult time getting BS to find those tags to extract the information.

RobTheBank
  • 43
  • 1
  • 3
  • Will `find_all("td")` help you? This will give you a list of all td tags. You may then use `get_text` to get the information between tags. Skip "E-mail Me" if you don't want the email address. – WKPlus Jan 03 '14 at 15:56
  • Just want to chime in with some alternatives worth trying: [lxml](http://lxml.de) (which contains BeautifulSoup) and [pyquery](https://pypi.python.org/pypi/pyquery). I strongly recommend pyquery. Discussion on [this thread](http://stackoverflow.com/questions/1922032/parsing-html-in-python-lxml-or-beautifulsoup-which-of-these-is-better-for-wha) may also be of interest. – floer32 Jan 03 '14 at 16:04
  • @WKPlus, this returns results that look promising. The results are messy and will require some additional cleanup, but that's data, right? – RobTheBank Jan 03 '14 at 17:07

3 Answers3

1

finding the exact part in an html document can be done by some specific tag name and attributes, if this is not possible, like in the html you shared, and the document structure is predictable, consider using tag position, meaning .fine_all('tag name')[nth location]

for example:

>>> soup.find_all('table')[1].tbody.find_all('td')[2]
<td>SOME CITY, STATE, ZIPCODE</td>
Guy Gavriely
  • 11,228
  • 6
  • 27
  • 42
1

You may use code like this:

tables = soup.table.table.tbody.find_all('table', attrs={'cellspacing' : '0'})
for ta in tables:
    tds = ta.find_all('td')
    for td in tds:
        text = td.get_text()
        if "E-mail Me" not in text and "Visit website" not in text:
            print text
WKPlus
  • 6,955
  • 2
  • 35
  • 53
0

Can you search for the individual tags by adding it onto the end of your existing code? I would also save it to some variable (it will return as a list I'm pretty sure). So something like

    info =soup.table.table.tbody.find_all('table', attrs={'cellspacing' : '0'}).find_all('td')

Then to extract, just iterate over the list with get_text:

    for item in info:
        item.get_text()
Pat
  • 3
  • 2
  • Thanks for the input, Pat. It doesn't look like I can do a 'double find_all' in that format. The original find_all returns an object, ResultSet, that doesn't allow for another find_all. That is kind of the logic I need to follow, though. – RobTheBank Jan 03 '14 at 16:54
  • 1
    @RobTheBank ResultSet has no attribute `find_all`, but each element in ResultSet is a Tag type and has a method `find_all` – WKPlus Jan 03 '14 at 18:09