0

Using HTML Agility Pack, I am trying to select nodes in XHTML using XPATH. I want to select the children I listed below in each p tag, but not the grandchildren:

<strike></strike>
<em></em>
<u></u>
<strong></strong>
<sub></sub>
<sup></sup>

In other words, I'm looking for A and B, but not the second level of either nodes. Mean while, A or B nodes can be found anywhere in the set. Note: That A or B can be any of the ones I listed above.

A and B image to represent the node tree

If I have the following XHTML:

<?xml version='1.0' encoding='utf-8'?>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta name="generator" content="HTML Tidy for Windows (vers 25 March 2009), see www.w3.org" />
<title></title>
</head>
<body>
    <p><strike>element 1</strike> and <strike><em>element 2</em></strike></p>
    <p><strike>element 3</strike></p>
    <p><strike>element 4</strike></p>
</body>
</html>

If I can select the children I listed above in each p tag, it will return the following collection of nodes: strike, strike, strike and strike. Giving me access to the children of each strike.

<strike>element 1</strike> and <strike><em>element 2</em></strike>

The first in XPATH means sub [1] (I mean instance of strike) and the second, which was ignored is sub [2] (I mean instance of strike). This makes sense because that's what my query is doing. Then the XPATH grabs the <em> tag and so on...

Another way I can explain this is by saying I want //a|//b|//c|//d|//e and not the children. Is this possible?

In the end, this leaves me confused in how I can arrive to my solution.

I was looking at MSDN for answers on XPATH.

Please let me know if you need further research or information. I will provide it.

Vyache
  • 381
  • 4
  • 15

2 Answers2

1

You use //.

This will select all matching nodes across the entire document, no matter at which level. If you want to select certain nodes only when they are directly under a p, do //p/strike. This will match a p node anywhere, but then only strike nodes directly under a p.

Frank van Puffelen
  • 565,676
  • 79
  • 828
  • 807
  • Sorry, this is not exactly what I was looking for. I believe my examples suggested using `//` and the use of the `p`. – Vyache Dec 12 '12 at 21:05
  • Why not? Is there any way you can reduce your question to a simpler reproduction of your problem/question? – Frank van Puffelen Dec 12 '12 at 21:07
  • Yes, but it will require looking for each type of tag. For example if I went with the `//p/strike`, then would need to look for `//td/p/strike`, `//li/p/strike` etc. I don't want to support more tags, but yes I have considered it if there is no other way. I'm worried about not catching other tags that I might miss. – Vyache Dec 12 '12 at 21:10
  • If you want to catch all tags, you can just match `//*`. But then the "only one level deep" logic in your original question makes no sense. How about removing all the things you tried from the question and leaving only: the XML, the best XPath you can come up with and the nodes that it mismatches? – Frank van Puffelen Dec 12 '12 at 21:22
  • Actually, I think I can work with `//p` because it its a root, I can get the children of each one and go from there. This will target tables as well. Thank you for the idea. – Vyache Dec 12 '12 at 21:23
  • However, I'm working with CKeditor and I'm worried that if a user manually enters a table, the `

    ` tag wont show up. I wonder if I can force it to have that. The answer is useful, but its not the solution yet. I'm still looking for a better answer.

    – Vyache Dec 12 '12 at 21:38
  • I took your suggestion and removed the examples. `//a|//b|//c|//d|//e` and not the children. Is this possible? – Vyache Dec 13 '12 at 19:48
  • You can probably get `element 1 and element 2` by selecting `//p/*`, although that would miss the text nodes. Also have a look here: http://stackoverflow.com/questions/1791108/xpath-expression-to-select-all-xml-child-nodes-except-a-specific-list – Frank van Puffelen Dec 13 '12 at 21:04
0

Using the advice of Frank van Puffelen and a friend at work, I came up with a good solution.

This problem will be solved in 2 steps.

First, I will select all the nodes I need SelectNodes("//strike|//em|//u|//strong|//sub|//sup").

Second, then I will use a for loop to go through all of the nodes I selected, looking at the nodes parent and if the parent is one of the following: strike, em, u, strong, sub and sup, then remove it and continue.

Thanks everyone.

Vyache
  • 381
  • 4
  • 15