0

My goal was to retrieve all nodes that contain a specific text.

1- I can retrieve nodes that contain some text with the folowing request:

[node for node in root.xpath('//*[contains(.,"Carte de chaleur")]') ]


Out[62]: 
[<Element workbook at 0x1818bc76e88>,
 <Element worksheets at 0x1819b886dc8>,
 <Element worksheet at 0x1819c156488>,
 <Element layout-options at 0x1819c1564c8>,
 <Element title at 0x1818e9509c8>,
 <Element formatted-text at 0x1819c156c48>,
 <Element run at 0x1818e955048>,
 <Element worksheet at 0x1819c156a88>,
 <Element layout-options at 0x1819c156fc8>,
 <Element title at 0x1818e9508c8>,
 <Element formatted-text at 0x1819c1565c8>,
 <Element run at 0x1818e955088>]

but when i checked, i only get 2 elements that contain the specific text.:

[node for node in root.xpath('//*[contains(.,"Carte de chaleur")]') if node.text.__contains__("Carte de chaleur")]
Out[66]: [<Element run at 0x1818e955048>, <Element run at 0x1818e955088>]

In fact when i look for the path of one of theses run nodes i can find that all the 'workbook',worksheets' etc... are in fact their parent nodes.

run_node
Out[71]: <Element run at 0x1818e955048>
tree.getpath(run_node)
Out[72]: '/workbook/worksheets/worksheet[3]/layout-options/title/formatted-text/run[1]'

So why this xpath query return me all the parent nodes of the nodes i am looking for (just the 2 run nodes in fact) ?

2- If i want nodes whose attribute contain a specific text i run this query:

root.xpath('//@*[contains(.,"bold")]/..')
Out[86]: 
[<Element format at 0x18199f56948>,
 <Element format at 0x18199f56148>]

(It 's logic since i want the node that contain a specific attribute nodes, so i am looking for the parent of this attribute node)

Very strangely, this request do not produce the same result:

root.xpath('//*[contains(@*,"bold")]') 

Even if for me this last one mean: "take any descendant element of the root whose any attribute contain the text "bold" (the same that the preceding one for me)

3- Can i retrieve the nodes whose attribute contain different value, using variable ?

For one variable i could do:

root('//*[@name=$var]', var="[Petal_length]") 

But is there a way to do something like:

root('//*[@name=$var1]//title[@format=$var2]', var1="[Petal_length]",var2="bold") 
Dharman
  • 30,962
  • 25
  • 85
  • 135
amous
  • 69
  • 6

1 Answers1

1

The string value of a node is the concatenation of all the text nodes contained within it, so if one node contains a particular substring in its string value, then all its ancestors will do so as well.

A question for you is what you would want returned for the input

<para>Carte <i>de</i> chaleur</para>

Would you want the para element returned, or not?

If you're happy for this not to be returned, then you're essentially saying that all the text must be found within a single text node, so you can do

//*[text()[contains(.,"Carte de chaleur")]]

If you do want the para returned, so your requirement is "find the lowest-level elements containing the text, without including their ancestors", then you might have to do something like

//*[contains(.,"Carte de chaleur") and not(*[contains(.,"Carte de chaleur")])]

I'm not even starting to think about efficiency here...

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • Very interesting! But what is the difference between: //*[text()[contains(.,"Carte de chaleur")]] and //*[contains(text(),"Carte de chaleur")] ? In fact i don't understand the meaning of the dot inside contains function. Anyway could you suggest any good documentation to get used to xpath query ? – amous Nov 06 '19 at 13:45
  • For good documentation, I recommend my book! (XSLT/XPath Programmers Reference from Wiley). The difference between (1) `contains(X, 'abc')` and (2) `X[contains(., 'abc')]` is that when X contains multiple items, (1) fails (in XPath 2), or considers only the first item (in XPath 1), whereas (2) returns true if any of the items contains 'abc' as a substring. – Michael Kay Nov 06 '19 at 17:42