Split text by Xpath.C# HtmlAgilityPack

Question

I have a HtmlNode with InnerHtml:

<a>SomeText</a>
DividerText:
<br>
TextToSelect1
<br/>
TextToSelect2
<br/>
TextToSelect3
<br>
TextToSelect4

It is possible to select all 'TextToSelect' only by XPath without c# Split or Regex?

like this: /text()/substring-after('DividerText:')

Or How can i get InnerHtml that excludes tag a?

What's the discriminant? Is it the fact they all start by TextToSelect? Or they are all after a BR that follows DividerText, etc.. ? — Simon Mourier, May 15 '13 at 14:37
@SimonMourier They are all after a BR that follows DividerText. But maybe i can simply remove node and than replace 'DividerText' to Empty string. How can i get InnerHtml that excludes tag? — Bogdan Kolodii, May 15 '13 at 14:45
It is not possible using Regex to return a subtree with elements removed. It would be possible though to return all text nodes of a subtree which are not inside an `` tag. — Jens Erat, May 15 '13 at 15:01

score 2 · Accepted Answer · answered May 15 '13 at 14:53

You can get all texts that follow a BR after a DividerText like this (in a sample console app):

  HtmlDocument doc = new HtmlDocument();
  doc.Load(MyTestHtm);

  foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//text()[contains(., 'DividerText:')]/following-sibling::br/following-sibling::text()"))
  {
      Console.WriteLine(node.InnerText.Trim());
  }

Will dump this out:

TextToSelect1
TextToSelect2
TextToSelect3
TextToSelect4

The XPATH expression first gets recursively a text() node that contains a specific 'DividerText:' token, then get all following siblings BR elements, than gets all following sibling text elements.

Jens Erat · Answer 2 · 2013-05-15T15:15:02.847

To select all text nodes following in the document:

//text()[contains(., 'DividerText:')]//following::text()

To select all sibling text nodes (following on the same level inside a wrapping element:

//text()[contains(., 'DividerText:')]//following-sibling::text()

If there is some text you need directly after, you would need XPath 2.0, this query also returns the part after the divider string, but needs the substring-after function that is not available in XPath 1.0:

//text()[contains(., 'DividerText:')]//(substring-after(., 'DividerText:'), following::text()/data())

If you're able to use XPath 2.0 or newer, there actually is an substring-after method:

substring-after(string-join(//text()), 'DividerText:')

You could also use //text() to fetch all text nodes and then use some substring-after() equivalent in C#, you might have to concatenate the resulting set/array.

Split text by Xpath.C# HtmlAgilityPack

2 Answers2