-1

this is HTML that I want to find

<a href="/audio-books/type/computer/page/2/">»</a>

problem is » what is

&raquo;

I have tried:

response.xpath('//div[@class="wp-pagenavi"]/a[@title="»"]' )

and

response.xpath('//div[@class="wp-pagenavi"]/a[@title="&raquo;"]' )

but it is not working.

Is there some way to check for a value in XPath if the value is from character entities or extended characters?

I am trying to find lin to next page so I can to use

response.xpath('//div[@class="wp-pagenavi"]/a[@title="2"]' )

and this is working fine.

WebOrCode
  • 6,852
  • 9
  • 43
  • 70
  • While not what you asked, by far the less painful and more accurate selector would use the URI in the `href` rather than what is essentially a presentation issue looking for the guillemet; so: `//a[contains(@href, "/page/")]/@href` (assuming you wanted the actual `href`; omit that `/@href` to just get the target `a` tag) – mdaniel Dec 29 '17 at 03:05

2 Answers2

2

First of all your path is incorrect because you are using title attribute to match, which is incorrect, the character is inside text() not title. This xpath should work:

response.xpath(u'//a[./text()="\xbb"]')
eLRuLL
  • 18,488
  • 9
  • 73
  • 99
  • Foremost, thank you for clarifying that it is the `text()` one should match, not `@title` that was used everywhere else in this question. As a tiny bit of pedantry, one need not qualify the `text()` with `./` as it is implied to be the `a` by the use of the array-brackets. To be extra cautious, one could say `a[string(.)="\xbb"]` to side-step if they wrap the guillemet in a `` or such in the future (whitespace concerns aside, cause I have finite characters here :-)) – mdaniel Dec 29 '17 at 03:15
  • @eLRuLL thank you. I have not even noticed that this HTML tag does not have a title, the problem was because other ones had. – WebOrCode Dec 29 '17 at 09:35
0

I haven't tried to run it but you should use the decimal entity for finding the extended characters via XPath.

For &raquo; you should use &#187;, you XPath should be like

div[@class="wp-pagenavi"]/a[@title="&#187;"]'

see the complete chart here for the reference.

If it does not work you can go for the Unicode character for &raquo;, Additionally you can see this post, hope this helps you out.

Muhammad Omer Aslam
  • 22,976
  • 9
  • 42
  • 68