-1

Using the below html I would like to pull 2 bits of data out and add them into a list in python. each bold text his a horse name and following that is the comments.

<div id="ANALYSIS" class="tabContent tabSelected">A weak handicap that looked wide open.
  <br>
  <br> <b class="black">LADY MAKFI</b> showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh.
  She saw it out well and it´ll be interesting to see how she copes with a rise.
  <br>
  <br> <b class="black">Weardiditallgorong</b> went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.
  <br>
  <br> <b class="black">Chauvelin</b>, in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.
  <br>
  <br> <b class="black">Happy Jack</b> not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]
  <br>
  <br>
  <div id="resultRaceReport" class="hide"></div>
</div>

from the above output i would like it to look like the following

[LADY MAKFI, showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh. She saw it out well and it´ll be interesting to see how she copes with a rise.]

[Weardiditallgorong, went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.]

[Chauvelin, in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.]

[Happy Jack, not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]]

but im just not sure how to get the desired output (more the logic behind it)

I currently use lxml to scrape content and would need to match the bold (horses name) against my table so I can add the comments (text after the bold) to my database

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
emma perkins
  • 749
  • 1
  • 10
  • 28

2 Answers2

2

using lxml:

h = """<div id="ANALYSIS" class="tabContent tabSelected">A weak handicap that looked wide open.<br><br> <b class="black">LADY MAKFI</b> showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh. She saw it out well and it´ll be interesting to see how she copes with a rise.<br><br> <b class="black">Weardiditallgorong</b> went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.<br><br> <b class="black">Chauvelin</b>, in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.<br><br> <b class="black">Happy Jack</b> not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]<br><br> <div id="resultRaceReport" class="hide"></div></div>"""

from lxml import html

x = html.fromstring(h)

div = x.xpath("//*[@id='ANALYSIS']")[0]

# find bold tags by class name
for b in div.xpath(".//b[@class='black']"):
    # get bold text
    print(b.text)
    # get text between current bold up to next br tag.
    print(b.xpath("./following::text()[1]"))

Would give you:

LADY MAKFI
[u' showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh. She saw it out well and it\xc2\xb4ll be interesting to see how she copes with a rise.']
Weardiditallgorong
[' went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.']
Chauvelin
[', in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.']
Happy Jack
[' not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]']

If you want it all in a single list exactly as posted:

from lxml import html

x = html.fromstring(h)
div = x.xpath("//*[@id='ANALYSIS']")[0]
out = [b.text + "," +  b.xpath("./following::text()[1]")[0].lstrip(",") for b in div.xpath(".//b[@class='black']")]

Which gives you:

[u'LADY MAKFI, showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh. She saw it out well and it\xc2\xb4ll be interesting to see how she copes with a rise.',
 'Weardiditallgorong, went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.',
 'Chauvelin, in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.',
 'Happy Jack, not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]']
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
1

I prefer Beautiful Soup's api over using lxml directly. I can avoid xpath entirely and just write python.

import bs4 
soup = bs4.BeautifulSoup(document, 'lxml')
[b.text + b.next_sibling.rstrip() for b in soup.find_all('b')]

output:

['LADY MAKFI showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh.\n  She saw it out well and it´ll be interesting to see how she copes with a rise.',
 'Weardiditallgorong went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.',
 'Chauvelin, in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.',
 'Happy Jack not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]']
Håken Lid
  • 22,318
  • 9
  • 52
  • 67