Extract value of the node with contains only English names

Question

A sample XML file is:

<book category="lovestory">
    <title lang="en">Сумерки</title>
    <author>Stephanie Meyer</author>
   <year>2003</year>
   <price>50.07</price>
</book>

So far I have something like this XPath string:

xpath.compile("/book[/title='[a-zA-z0-9]+']/author");

How can I get all authors of the English books? (I mean that titles are latin npt cyrillic) (This is Russian)

John le Carré is English. You can't assume that if something contains letters outside the range a-z then it's not English. — Michael Kay, Sep 17 '19 at 22:58

AndiCover · Accepted Answer · 2019-09-22T19:43:55.460

0

Your XPath is almost correct. Try to use following XPath:

//book//title[@lang='en']//..//author

Explanation:

You select all books which title is in english //book//title[@lang='en'] and take the author of this book //..//author.

If you cannot rely on the lang attribute you can use regex (as you tried in your example). Following XPath uses a regular expression. The matches function is required which is available in XPath 2.0:

//book//title[matches(text(), '[a-zA-z0-9]+')]//..//author

Single slash / means node which is a direct child of the current.

Double slash // means any descendant node of the current node in the html tree which matches the locator.

edited Sep 22 '19 at 19:43

answered Sep 17 '19 at 18:05

AndiCover

1,724
3
17
38

Note that `[A-z]` matches more than `[a-zA-Z]` https://stackoverflow.com/questions/29771901/why-is-this-regex-allowing-a-caret – The fourth bird Sep 17 '19 at 19:11
1

There is also an XPath 1.0 solution: `//book[not(translate(title,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ',''))]` – Alejandro Sep 17 '19 at 20:57

Extract value of the node with contains only English names

1 Answers1