How does join work in python beautifulsoup

Question

I am learning python and beautifulsoup, and saw this code online:

from BeautifulSoup import BeautifulSoup, SoupStrainer
import re

html = ['<html><body><p align="center"><b><font size="2">Table 1</font></b><table><tr><td>1. row 1, cell 1</td><td>1. row 1, cell 2</td></tr><tr><td>1. row 2, cell 1</td><td>1. row 2, cell 2</td></tr></table><p align="center"><b><font size="2">Table 2</font></b><table><tr><td>2. row 1, cell 1</td><td>2. row 1, cell 2</td></tr><tr><td>2. row 2, cell 1</td><td>2. row 2, cell 2</td></tr></table></html>']
soup = BeautifulSoup(''.join(html))
searchtext = re.compile(r'Table\s+1',re.IGNORECASE)
foundtext = soup.find('p',text=searchtext) # Find the first <p> tag with the search text
table = foundtext.findNext('table') # Find the first <table> tag that follows it
rows = table.findAll('tr')
for tr in rows:
    cols = tr.findAll('td')
    for td in cols:
        try:
            text = ''.join(td.find(text=True))
        except Exception:
            text = ""
        print text+"|",
    print

while everything else is clear, i could not understand how the join is working.

    text = ''.join(td.find(text=True))

I tried searching the BS documentation for join, but i couldn't find anything and couldn't really find help online as well on how join is used in BS.

Please let me know how that line works. thanks!

PS: the above code is from another stackoverflow page, its not my homework :) How can I find a table after a text string using BeautifulSoup in Python?

Martijn Pieters · Accepted Answer · 2012-09-03T19:59:47.413

6

''.join() is a python function, not anything BS specific. It let's you join a sequence with the string as a joining value:

>>> '-'.join(map(str, range(3)))
'0-1-2'
>>> ' and '.join(('bangers', 'mash'))
'bangers and mash'

'' is simply the empty string, and makes joining a whole set of strings together into one large one easier:

>>> ''.join(('5', '4', 'apple', 'pie'))
'54applepie'

In the specific case of your example, the statement finds all text contained in the <td> element, including any contained HTML elements such as <b> or <i> or <a href=""> and puts them all together into one long string. So td.find(text=True) finds a sequence of python strings, and ''.join() then joins those together into one long string.

edited Sep 03 '12 at 19:59

answered Sep 03 '12 at 19:54

Martijn Pieters

1,048,767
296
4,058
3,343

@martijin - thanks for the explanation, i think i got it now! – user1644208 Sep 03 '12 at 20:04
done! i couldnt mark the answer because of the time delay imposed by SO for people with low reps :D anyway, thanks again! – user1644208 Sep 03 '12 at 20:19

score 0 · Answer 2 · answered Sep 03 '12 at 19:54

0

Join isn't part of BeautifulSoup, but is a built-in method of strings in Python. It joins a sequence of elements together with the given string; e.g., '+'.join(['a', 'b', 'c']) is a+b+c. See the documentation.

answered Sep 03 '12 at 19:54

BrenBarn

242,874
37
412
384

score 0 · Answer 3 · answered Sep 05 '12 at 05:31

The code is incorrect. This line:

text = ''.join(td.find(text=True))

uses find, which returns the first string child of the td tag and attempts to use join on it. It works correctly because ''.join() just iterates over the first string child, creating a copy.

So this:

<td>foo<b>bar</b></td>

just runs ''.join("foo").

Instead, use the td.text property. It automatically finds all strings in the td and joins them.

text = td.text

How does join work in python beautifulsoup

3 Answers3