0

Since the great comment and link to the great post Never parse markup with regex by @kjhughes in my previous question Regex repeat expression I have been changing as many unneeded regular expressions in my application which I used to remove content over writing a complete XPath.

But for the following I am wondering if there is also a way to solve it with XPath:

Data Name: Herr FirstName LastName

XPath so far: //body//div/div/table/tr/td/div/table/tr[3]/td/div/table/tr/td/p[1]/span/text()

Here I use following regex on: (?<=Herr |Frau ).*

This Because I only want the data Firstname LastName The reason I am asking for again a name is that this are two different mails I am scraping with different templates and want the application to be modular.

At the moment I do this still quite often in the application that I just remove all unwanted text with a regex, for this reason I want to know if it is also possible with XPath. This way I learn more about the XPath scraping and do not harm unholy childs :)

svenQ
  • 119
  • 1
  • 1
  • 13
  • 1
    You may use regex or other string manipulation methods when you have *plain text*. – Wiktor Stribiżew Mar 16 '18 at 14:28
  • XPath and regex are not mutually exclusive. With XPath 2.0 (or greater), you can use regex expressions with `matches()`, `replace()`, `analyze-string()`, and `tokenize()`. https://www.w3.org/TR/xpath-functions-31/#func-matches – Mads Hansen Mar 16 '18 at 15:02
  • https://stackoverflow.com/questions/1525299/xpath-and-xslt-2-0-for-net – Mads Hansen Mar 16 '18 at 15:31
  • Please make this question independent of your last question. That means you need to provide your sample data here. – Adam Katz Mar 16 '18 at 19:02

2 Answers2

2

Assuming that the text() value of the XPath that you provided was "Name: Herr FirstName LastName"

Here is an example of how you can use regex in an XPath 2.0 statement to select the text() node if it contains "Herr" or "Frau" using matches() (positive lookahead and negative lookbehind are not currently supported), and then use replace() with a regex on that text() node value with a capture group to select the value "First Last"

//body//div/div/
  table/tr/td/div/
  table/tr[3]/td/div/
  table/tr/td/p[1]/
  span/text()[matches(., "Herr|Frau ")]/replace(.,'.*Herr|Frau (.*)', '$1')
Mads Hansen
  • 63,927
  • 12
  • 112
  • 147
0

As Wiktor Stribiżew comments, you needn't avoid using regex on plain text from XML – it's markup which shouldn't be parsed via regex.

Mads Hansen shows how to use regex in XPath 2.0.

Here's a way to extract your targeted text if you only have XPath 1.0:

substring(normalize-space( your XPath here ), 12)

kjhughes
  • 106,133
  • 27
  • 181
  • 240