It's true that the html isn't great practice, but it's around and people have to deal with it. I think what will help here is normalize-space()
. Try this in a chrome console on a page that has that html:
$x(`//label[contains(normalize-space(), "some text")]`);
I think the above code breaks down into this:
- Find all
label
nodes
- For each of those nodes
- Get all the
text
nodes inside that node and its descendant nodes
- Combine their text
- Remove extra whitespace from the text in each text node
- If the exact string I want ("some text") is in there, add that
label
to the list of nodes to return.
- Return a list of all the nodes that had that text (or an empty list if none matched)
To get a bit more complicated, if I want to find the node that is the direct parent of a text
node that has "some text" I can do this:
$x(`//label//text()[contains(normalize-space(), "some text")]/..`);
That gets tricky, though. For every text
node that contains the given string, it gets the direct parent. It works fine for this because "some text" is all in one text
node. In other situations it could trip someone up. For example, look at this html:
<label>
<span> text </span>
some other text
</label>
As a human, I can read "text some other text" so it seems like I should be able to find the parent of "text some". If I use the first method to look for a label
with "text some" in it, this code will find that label
just fine:
$x(`//label[contains(normalize-space(), "text some")]`);
That's because a combination of all the text nodes does have "text some" in it. I can't use the second method, though. For example, using this code on that html only gives me an empty list:
$x(`//label//text()[contains(normalize-space(), "text some")]/..`);
That is because when xpath looks in each individual text
node, none of them has "text some" inside it. If I think about it, "text some" doesn't really have one direct parent. It sort of has two direct parents. xpath chooses to give no parents instead of both of them. Seems fair enough.
I haven't found any great explanations of normalize-space()
and I don't know why it combines the text of all its descendant nodes. As far as why text()
doesn't work, I found this answer from 2014:
text()
is short for child::text()
, and selects the text nodes that are immediate children of the label element.
-- https://stackoverflow.com/a/26823289/14144258
It might still be right.
Also, I know this is an old question, but I ran into the same problem trying to troubleshoot my own code and it took ages to find a solution.