Unable to parse names properly from some elements

Question

I've written a script in python to parse some names out of some elements. When i execute my script, it does parse names but the output is weird to look at. The names are being parsed in such a way so that it looks like two big names. The names are separated by br tag. How can i get each names individually?

Elements within which the names are:

html_content='''
<div class="second-child"><div class="richText"> <p></p>
<p><strong>D<br></strong>Daiwa House Industry<br>Danske Bank<br>DaVita HealthCare Partners<br>Delphi Automotive<br>Denso<br>Dentsply International<br>Deutsche Boerse<br>Deutsche Post<br>Deutsche Telekom<br>Diageo<br>Dialight<br>Digital Realty Trust<br>Donaldson Company<br>DSM<br>DS Smith </p>
<p><strong>E<br></strong>East Japan Railway Company<br>eBay<br>EDP Renováveis<br>Edwards Lifesciences<br>Elekta<br>EnerNOC<br>Enphase Energy<br>Essilor<br>Etsy<br>Eurazeo<br>European Investment Bank (EIB)<br>Evonik Industries<br>Express Scripts&nbsp;<br><br><strong>F<br></strong>Fielmann<br>First Solar<br>FMO<br>Ford Motor<br>Fresenius Medical Care<br><br></p></div></div>
'''

The script I've written to parse names:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content,"lxml")
for items in soup.select(".second-child"):
    name = ' '.join([item.text for item in items.select("p")])
    print(name)

Output I'm having (partial result):

DDaiwa House IndustryDanske BankDaVita HealthCare PartnersDelphi AutomotiveDensoDentsply InternationalDeutsche

Output I wanna get:

DDaiwa House Industry
Danske Bank
DaVita HealthCare Partners
Delphi Automotive
Denso
Dentsply International

FYI, when i take a closer look at the result, I could find that each separate names are attached to each other with no gap in between.

score 2 · Answer 1 · answered Nov 08 '17 at 12:50

Using item.text removes all the tags, you need to replace the <br> tags with '\n'. Using the answer provided by Ian Mackinnon for the question: Convert </br> to end line

your script should be:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content,"lxml")

for br in soup.find_all("br"):
    br.replace_with("\n")

for items in soup.select(".second-child"):
    name = ' '.join([item.text for item in items.select("p")])
    print(name)

and the output:

 D
Daiwa House Industry
Danske Bank
DaVita HealthCare Partners
Delphi Automotive
Denso
Dentsply International
Deutsche Boerse
Deutsche Post
Deutsche Telekom
Diageo
Dialight
Digital Realty Trust
Donaldson Company
DSM
DS Smith  E
East Japan Railway Company
eBay
EDP Renováveis
Edwards Lifesciences
Elekta
EnerNOC
Enphase Energy
Essilor
Etsy
Eurazeo
European Investment Bank (EIB)
Evonik Industries
Express Scripts 

F
Fielmann
First Solar
FMO
Ford Motor
Fresenius Medical Care

Thanks Amjad Gd for your solution. It worked as well. – SIM Nov 08 '17 at 13:39 — SIM, Nov 08 '17 at 13:39

score 1 · Accepted Answer · answered Nov 08 '17 at 13:03

1

Check below solution and let me know if some improvements required:

for items in soup.select(".second-child"):
    for text_nodes in items.select("p"):
        name = " \n".join([item for item in text_nodes.strings if item])
        print(name)

Output

D 
Daiwa House Industry 
Danske Bank 
DaVita HealthCare Partners 
Delphi Automotive 
Denso 
Dentsply International 
Deutsche Boerse 
Deutsche Post 
Deutsche Telekom 
Diageo 
Dialight 
Digital Realty Trust 
Donaldson Company 
DSM 
DS Smith 
E 
East Japan Railway Company 
eBay 
EDP RenovÃ¡veis 
Edwards Lifesciences 
Elekta 
EnerNOC 
Enphase Energy 
Essilor 
Etsy 
Eurazeo 
European Investment Bank (EIB) 
Evonik Industries 
Express Scripts  
F 
Fielmann 
First Solar 
FMO 
Ford Motor 
Fresenius Medical Care

answered Nov 08 '17 at 13:03

Andersson

51,635
17
77
129

Thanks sir Andersson. It worked just great. No improvement is required. Thanks again. – SIM Nov 08 '17 at 13:40
You have just driven me crazy, sir. Every time you come up with newer methods while providing solutions. Could you please tell me in short what does this `strings` do here that text couldn't? Thanks again, sir. – SIM Nov 08 '17 at 14:03
For example, for `soup.select(".second-child")[0]` node `text` property will return you text as one looong single string while `strings` property will return generator of all descendant text nodes – Andersson Nov 08 '17 at 14:08

Unable to parse names properly from some elements

2 Answers2