BeautifulSoup parsing table and filter second rows

Question

I followed this and want to filter everything after br.

Here is an example:

<td class="flightAirport first">Palma de Mallorca<br><span class="second_row">nach Berlin Tegel</span></td>

What I get is: >Palma de Mallorcanach Berlin Tegel<

What I tried:
Stripping 'nach Berlin Tegel' from the string, gives a string with missing characters like >Palma de Mallor<.

My Question is, how can I get rid of any second line to not deal with strip()?

Thx.

Edit: replace() gives the right result. But if it's possible to filter it in the first place, it would be great to know how.

score 0 · Answer 1 · answered Jun 02 '18 at 22:25

BeautifulSoup's text functions include one called stripped_strings that returns a generator of all the strings found within the soup. The first string of that generator will be the text before the br tag. Altering the code you referenced and using it to parse your sample html:

from bs4 import BeautifulSoup

sample_table = """
<table>
  <tr>
    <td class="flightAirport first">Palma de Mallorca<br><span class="second_row">nach Berlin Tegel</span></td>
    <td class="flightAirport first">LAX</td>
    <td class="flightAirport first"></td>
  </tr>
</table>"""

data = []
soup = BeautifulSoup(sample_table, 'html.parser')
table = soup.find("table")
for row in table.findAll("tr"):
    cols = [ e.stripped_strings.next() for e in row.find_all('td') if len(e.text)]
    data.append([e for e in cols if e])

print(data)

BeautifulSoup parsing table and filter second rows

1 Answers1