How can I find a table after a text string using BeautifulSoup in Python?

Question

I am trying to extract data from several web pages which are not uniform in how they display their tables. I need to write code that will search for a text string and then go to the table immediately following that specific text string. Then I want to extract the contents of that table. Here's what I've got so far:

from BeautifulSoup import BeautifulSoup, SoupStrainer
import re

html = ['<html><body><p align="center"><b><font size="2">Table 1</font></b><table><tr><td>1. row 1, cell 1</td><td>1. row 1, cell 2</td></tr><tr><td>1. row 2, cell 1</td><td>1. row 2, cell 2</td></tr></table><p align="center"><b><font size="2">Table 2</font></b><table><tr><td>2. row 1, cell 1</td><td>2. row 1, cell 2</td></tr><tr><td>2. row 2, cell 1</td><td>2. row 2, cell 2</td></tr></table></html>']
soup = BeautifulSoup(''.join(html))
searchtext = re.compile('Table 1',re.IGNORECASE) # Also need to figure out how to ignore space
foundtext = soup.findAll('p',text=searchtext)
soupafter = foundtext.findAllNext()
table = soupafter.find('table') # find the next table after the search string is found
rows = table.findAll('tr')
for tr in rows:
    cols = tr.findAll('td')
    for td in cols:
        try:
            text = ''.join(td.find(text=True))
        except Exception:
            text = ""
        print text+"|",
print

However, I get the following error:

    soupafter = foundtext.findAllNext()
AttributeError: 'ResultSet' object has no attribute 'findAllNext'

Is there an easy way to do this using BeautifulSoup?

score 10 · Answer 1 · edited May 23 '17 at 12:03

The error is due to the fact that findAllNext is a method of Tag objects, but foundtext is a ResultSet object, which is a list of matching tags or strings. You could iterate through the each of the tags in foundtext, but depending on your needs it might be sufficient to use find, which returns only the first matching tag.

Here's a modified version of your code. After changing foundtext to use soup.find, I found and fixed the same problem with table. I modified your regex to ignore whitespace between the words:

from BeautifulSoup import BeautifulSoup, SoupStrainer
import re

html = ['<html><body><p align="center"><b><font size="2">Table 1</font></b><table><tr><td>1. row 1, cell 1</td><td>1. row 1, cell 2</td></tr><tr><td>1. row 2, cell 1</td><td>1. row 2, cell 2</td></tr></table><p align="center"><b><font size="2">Table 2</font></b><table><tr><td>2. row 1, cell 1</td><td>2. row 1, cell 2</td></tr><tr><td>2. row 2, cell 1</td><td>2. row 2, cell 2</td></tr></table></html>']
soup = BeautifulSoup(''.join(html))
searchtext = re.compile(r'Table\s+1',re.IGNORECASE)
foundtext = soup.find('p',text=searchtext) # Find the first <p> tag with the search text
table = foundtext.findNext('table') # Find the first <table> tag that follows it
rows = table.findAll('tr')
for tr in rows:
    cols = tr.findAll('td')
    for td in cols:
        try:
            text = ''.join(td.find(text=True))
        except Exception:
            text = ""
        print text+"|",
    print

This outputs:

1. row 1, cell 1| 1. row 1, cell 2|
1. row 2, cell 1| 1. row 2, cell 2|

How can I find a table after a text string using BeautifulSoup in Python?

1 Answers1

Linked