2

Given this html:

<ul>
    <li>This is <a href="#">a link</a></li>
    <li>This is <a href="#">another link</a>.</li>
</ul>

How can I use XPath to get the following result:

[
    'This is a link',
    'This is another link.'
]

What I've tried:

//ul/li/text()

But this gives me ['This is ', 'This is .'] (withoug the text in the a tags

Also:

string(//ul/li)

But this gives me ['This is a link'] (so only the first element)

Also

//ul/li/descendant-or-self::text()

But this gives me ['This is ', 'a link', 'This is ', 'another link', '.']

Any further ideas?

Tomalak
  • 332,285
  • 67
  • 532
  • 628
Peter
  • 13,733
  • 11
  • 75
  • 122

2 Answers2

4

XPath generally cannot select what is not there. These things do not exist in your HTML:

[
    'This is a link',
    'This is another link.'
]

They might exist conceptually on the higher abstraction level that is the browser's rendering of the source code, but strictly speaking even there they are separate, for example in color and functionality.

On the DOM level there are only separate text nodes and that's all XPath can pick up for you.

Therefore you have three options.

  1. Select the text() nodes and join their individual values in Python code.
  2. Select the <li> elements and for each of them, evaluate string(.) or normalize-space(.) with Scrapy. normalize-space() would deal with whitespace the way you would expect it.
  3. Select the <li> elements and access their .text property – which internally finds all descendant text nodes and joins them for you.

Personally I would go for the latter with //ul/li as my basic XPath expression as this would result in a cleaner solution.


As @paul points out in the comments, Scrapy offers a nice fluent interface to do multiple processing steps in one line of code. The following code implements variant #2:

selector = scrapy.Selector(text='''<ul>
    <li>This is <a href="#">a link</a></li>
    <li>This is <a href="#">another link</a>.</li>
</ul>''')

selector.css('ul > li').xpath('normalize-space()').extract()
# --> [u'This is a link', u'This is another link.']
Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • Can you elaborate on accessing the `.text` property? I know a DOM element has a text property in HTML, but for XPath, this is just plain XML and there is no DOM, no? – Peter Dec 12 '16 at 14:07
  • As far as I have understood it from the docs, Scrapy returns lxml objects from your XPath operations. And they support `.text`. – Tomalak Dec 12 '16 at 14:08
  • Ah ok, I get what you mean. Thanks. – Peter Dec 12 '16 at 14:12
  • The shortcoming with XPath here is not so much that the desired strings don't exist in the HTML as it is that XPath 1.0 can't apply the string value function individually to the results of the targeted XPath expression. If Scrapy used XPath 2.0, [OP's goal could be met entirely within XPath](http://stackoverflow.com/a/41103378/290085). – kjhughes Dec 12 '16 at 14:48
  • I was hesitant to post a solution that would not work in the context of Scrapy. But in order to provide the context of the restriction it's not a bad thing, thanks for the update. – Tomalak Dec 12 '16 at 15:11
  • 1
    @Peter There's actually a third option of doing it in Scapy, see updated answer. – Tomalak Dec 12 '16 at 15:28
  • 1
    2) is rather straightforward with scrapy selectors where you can chain XPath and CSS: http://pastebin.com/tWyDZswv – paul trmbrth Dec 12 '16 at 15:44
  • @paultrmbrth Nice addition, you should probably post that as an answer of your own (otherwise I would add it to my answer if you don't mind) – Tomalak Dec 12 '16 at 15:46
  • You can add it to your answer @Tomalak, that's fine – paul trmbrth Dec 12 '16 at 15:47
  • 1
    @paul Thanks, that's done. Am I right to assume that `selector.xpath('//ul/li').xpath('normalize-space()')` would work as well? – Tomalak Dec 12 '16 at 16:03
  • 1
    yes, it works too. you are right, I could have focused on XPath alone as the OP was asking about it only. – paul trmbrth Dec 12 '16 at 16:05
  • Many ways to skin a cat, as they say. – Tomalak Dec 12 '16 at 16:48
2

@Tomalak is correct in saying that XPath generally cannot select that which is not there.

However, in this case, the results you want are the string values of li elements. As you've found,

string(//ul/li)

gets you close but only returns the first desired string.

This points to a shortcoming in XPath 1.0 that was addressed in XPath 2.0.

In XPath 1.0, you have to iterate over the nodeset selected by //ul/li outside of XPath -- in XSLT, Python, Java, etc.

In XPath 2.0, the last location step can be a function, so you can use,

//ul/li/string()

to directly return

This is a link
This is another link.

as requested.

This is more educational than practical if you're stuck with Scrapy, which only supports XPath 1.0, but knowing

is generally helpful in reasoning about XPath text selections.

Community
  • 1
  • 1
kjhughes
  • 106,133
  • 27
  • 181
  • 240
  • I did not recommend `string(//ul/li)` because that would return *one* string, instead of however many `
  • ` elements were there, wouldn't it?
  • – Tomalak Dec 12 '16 at 15:09
  • Absolutely correct, and OP reported as much in the question -- `string(//ul/li)` only returns the string value of the *first* result of `//ul/li`. Your answer is entirely correct (+1); I just wanted to elaborate on the exact concepts in play. Thanks. – kjhughes Dec 12 '16 at 15:15