0

I'm trying to scrape the title of the following html code:

<FONT COLOR=#5FA505><B>Claim:</B></FONT> &nbsp; Coed makes unintentionally risqu&eacute; remark about professor's "little quizzies."
<BR><BR>
<CENTER><IMG SRC="/images/content-divider.gif"></CENTER>

I'm using this code:

def parse_article(self, response):
             for href in response.xpath('//font[b = "Claim:"]/following-sibling::text()'):
                        print href.extract()

and I succesfully pull the correct Claim: value that I want from the aforementioned html code but it also, (among others with similar structure in the same page) pulls the below html. I am defining my xpath() to just pull in the font tag named Claim: so why is it pulling in the below Origins as well? And how can I fix it? I tried seeing if I could get only the next following-sibling instead of all of them, but that didn't work

<FONT COLOR=#5FA505 FACE=""><B>Origins:</B></FONT> &nbsp; Print references to the "little quizzies" tale date to 1962, but the tale itself has been around since the early 1950s. It continues to surface among college students to this day. Similar to a number of other college legends
Rafa
  • 3,219
  • 4
  • 38
  • 70

2 Answers2

0

I think your xpath is missing text() qualifier (explained here). It should be:

'//font/[b/text()="Claim:"]/following-sibling::text()'
Community
  • 1
  • 1
Łukasz
  • 35,061
  • 4
  • 33
  • 33
0

The following-sibling axis returns all siblings following an element. If you only want the first sibling, try the XPath expression:

//font[b = "Claim:"]/following-sibling::text()[1]

Or, depending on your exact use case:

(//font[b = "Claim:"]/following-sibling::text())[1]
nwellnhof
  • 32,319
  • 7
  • 89
  • 113