3

I am learning python and beautifulsoup, and saw this code online:

from BeautifulSoup import BeautifulSoup, SoupStrainer
import re

html = ['<html><body><p align="center"><b><font size="2">Table 1</font></b><table><tr><td>1. row 1, cell 1</td><td>1. row 1, cell 2</td></tr><tr><td>1. row 2, cell 1</td><td>1. row 2, cell 2</td></tr></table><p align="center"><b><font size="2">Table 2</font></b><table><tr><td>2. row 1, cell 1</td><td>2. row 1, cell 2</td></tr><tr><td>2. row 2, cell 1</td><td>2. row 2, cell 2</td></tr></table></html>']
soup = BeautifulSoup(''.join(html))
searchtext = re.compile(r'Table\s+1',re.IGNORECASE)
foundtext = soup.find('p',text=searchtext) # Find the first <p> tag with the search text
table = foundtext.findNext('table') # Find the first <table> tag that follows it
rows = table.findAll('tr')
for tr in rows:
    cols = tr.findAll('td')
    for td in cols:
        try:
            text = ''.join(td.find(text=True))
        except Exception:
            text = ""
        print text+"|",
    print

while everything else is clear, i could not understand how the join is working.

    text = ''.join(td.find(text=True))

I tried searching the BS documentation for join, but i couldn't find anything and couldn't really find help online as well on how join is used in BS.

Please let me know how that line works. thanks!

PS: the above code is from another stackoverflow page, its not my homework :) How can I find a table after a text string using BeautifulSoup in Python?

Community
  • 1
  • 1
user1644208
  • 105
  • 5
  • 12

3 Answers3

6

''.join() is a python function, not anything BS specific. It let's you join a sequence with the string as a joining value:

>>> '-'.join(map(str, range(3)))
'0-1-2'
>>> ' and '.join(('bangers', 'mash'))
'bangers and mash'

'' is simply the empty string, and makes joining a whole set of strings together into one large one easier:

>>> ''.join(('5', '4', 'apple', 'pie'))
'54applepie'

In the specific case of your example, the statement finds all text contained in the <td> element, including any contained HTML elements such as <b> or <i> or <a href=""> and puts them all together into one long string. So td.find(text=True) finds a sequence of python strings, and ''.join() then joins those together into one long string.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
0

Join isn't part of BeautifulSoup, but is a built-in method of strings in Python. It joins a sequence of elements together with the given string; e.g., '+'.join(['a', 'b', 'c']) is a+b+c. See the documentation.

BrenBarn
  • 242,874
  • 37
  • 412
  • 384
0

The code is incorrect. This line:

text = ''.join(td.find(text=True))

uses find, which returns the first string child of the td tag and attempts to use join on it. It works correctly because ''.join() just iterates over the first string child, creating a copy.

So this:

<td>foo<b>bar</b></td>

just runs ''.join("foo").

Instead, use the td.text property. It automatically finds all strings in the td and joins them.

text = td.text
Aaron DeVore
  • 63
  • 2
  • 7