python, lxml or etree to get a parent of a node containing some text

Question

how can i proceed to get the parent node of a node containing a piece of text?

moreover can i use some regexp mecanism as the matched element for searching/filtering, below searching from re.compile("th[ei]s? .ne") for example?

say this one

html = '''<html>
<head><title></title></head>
<body>
<table>
<tr><td>1a</td><td>2a</td><td>3a</td><td>4a</td><td>5a</td><td>6a</td></tr>
<tr><td>1b</td><td>2b</td><td>3b</td><td>4b</td><td>5b</td><td>6b</td></tr>
<tr><td>1c</td><td>2c</td><td>3c</td><td>4c</td><td>5c</td><td>6c this one</td></tr>
</table>
<div><div>
<table>
<tr><td>1A</td><td>2A</td><td>3A</td><td>4A</td><td>5A</td><td>6A</td></tr>
<tr><td>1B</td><td>2B</td><td>3B</td><td>4B</td><td>5B</td><td>6B</td></tr>
<tr><td>1C</td><td>2C</td><td>3C</td><td>4C</td><td>5C</td><td>6C</td></tr>
</table>this one
</div></div>
</body>
</html>'''

i would like to have an iterator that return:

<td>6c this one</td>

and then:

<div>
<table>
<tr><td>1A</td><td>2A</td><td>3A</td><td>4A</td><td>5A</td><td>6A</td></tr>
<tr><td>1B</td><td>2B</td><td>3B</td><td>4B</td><td>5B</td><td>6B</td></tr>
<tr><td>1C</td><td>2C</td><td>3C</td><td>4C</td><td>5C</td><td>6C</td></tr>
</table>this one
</div>

i tried:

import lxml.html
root = lxml.html.document_fromstring(html)
root.xpath("//text()[contains(., one)]")

and

import xml.etree.ElementTree as ET
for e in ET.fromstring(html).getiterator():
    if e.text and e.text.find('one') != -1:
        print "Found string %r, element = %r" % (e.text, e)

but the best i can have is the node containing this one itself... while i am looking for the parent containing this text. notice that div or table are only for example, i really need to go backward to the parent after finding "this one" rather than filtering xml element containing this one because i will not know that this is a div, a table or anything before finding what it contains.

(notice also that it is html and not well formated xml, as i suppose que the second this one should have been wrapped in a xml tag)

EDIT:

>>> root.xpath("//*[contains(child::*/text(), 'one')]") # why empty parent?
[]
>>> root.xpath("//*[contains(text(), 'one')]") # i expected to have a list with two elements td and div
[<Element td at 0x280b600>]
>>> root.xpath("//*[child::*[contains(text(), 'one')]]") # if parent: expected tr and div, if not parent expected table or div, still missing one
[<Element tr at 0x2821f30>]

BTW, using the last is ok:

import xml.etree.ElementTree as ET
import lxml.html
#[... here add html = """...]
root = lxml.html.document_fromstring(html)
for i, x in enumerate(root.xpath("//text()[contains(., 'one')]/parent::*")):
    print "%s => \n\t" % i, ET.tostring(x).replace("\n", "\n\t")

produce:

0 => 
    <td>6c this one</td>
1 => 
    <div>
    <table>
    <tr><td>1A</td><td>2A</td><td>3A</td><td>4A</td><td>5A</td><td>6A</td></tr>
    <tr><td>1B</td><td>2B</td><td>3B</td><td>4B</td><td>5B</td><td>6B</td></tr>
    <tr><td>1C</td><td>2C</td><td>3C</td><td>4C</td><td>5C</td><td>6C</td></tr>
    </table>this one
    </div>

dirkk · Accepted Answer · 2013-06-18T15:33:39.897

6

Based on your example output it seems like you want to get the element which contains the specified text one. Your description says you want the parent of this node.

Based on this assumption you can get the desired nodes using the following XPath:

//*[contains(text(), 'one')]

If you really want the parent of this node, you can do

//*[child::*[contains(text(), 'one')]]

By the way, as you can see I used a predicate to get the node, so I filtered the XML nodes. In my opinion, this is the more logical and readable approach, as it basically say Give me all the nodes which fulfill the given condition rather than saying Give me the output of my condition and from this point on search for the actually desired output. But you could also do something like the following, which would better match your proposed solution:

//text()[contains(., 'one')]/parent::*

edited Jun 18 '13 at 15:33

answered Jun 18 '13 at 14:17

dirkk

6,160
5
33
51

yes, i need the element which contains the specified text, i was speaking about the parent because i suppose that the xml parser considers a "not visible" text node containing the text data. Not sure i am right here. btw i can not get the parent, nor a list of the matching nodes (here 2) see my edit. And any clues on how to have matching on regexp? – user1340802 Jun 18 '13 at 14:40
Sorry, my second XPath was not correct, I updated it. The second query you run I would also expect two elements td and div, but I get the same result as you did. If I run this using another XPath/Xquery processor I get the correct and expected result. This looks like a bug in lxml to me. Regarding regex, please take a look at my answer a few days ago: http://stackoverflow.com/questions/16934852/lxml-find-div-with-id-post-0-9/16935525#16935525 – dirkk Jun 18 '13 at 15:36
@user1340802: Please specify in what way the output of the 2nd XPath is not correct. What output did you receive, and what did you expect to receive? – LarsH Jun 18 '13 at 16:12
1

@user1340802 It does not return the correct result, but the XPath itself is correct. This is a bug in lxml which does not properly execute this XPath. I tested this in other query processors and it works just fine, so this is not an issue with the XPath, but with lxml. There is nothing we can do about a bug in a specific program. You should contact the developers and submit a bug report. – dirkk Jun 18 '13 at 17:14
@LarsH se the EDIT part of the question, i have already test all the expression & outputs. – user1340802 Jun 19 '13 at 07:46

score 1 · Answer 2 · answered Jun 19 '13 at 17:45

>>> root.xpath("//*[contains(child::*/text(), 'one')]") # why empty parent?
[]

This XPath expression selects every element for which the first grandchild text node contains 'one'. The first argument to contains() is expected to be a string, so XPath takes the first node in the result of child::*/text() and takes its string value. Since no element has a text node containing "one" as its first grandchild, the answer is an empty nodelist.

>>> root.xpath("//*[contains(text(), 'one')]")
# i expected to have a list with two elements td and div
[<Element td at 0x280b600>]

For the same reason, this XPath expression selects all elements whose first text node child contains 'one'. That's why the <td> is selected, but the <div> isn't: the div's child text node containing 'one' is not its first child text node.

>>> root.xpath("//*[child::*[contains(text(), 'one')]]")
# if parent: expected tr and div,
# if not parent expected table or div, still missing one
[<Element tr at 0x2821f30>]

This faces the same limitation as the previous expression.

Have you tried the last solution that @dirkk proposed,

//text()[contains(., 'one')]/parent::*

That should avoid your problem with passing multiple nodes as the first argument to contains().

python, lxml or etree to get a parent of a node containing some text

2 Answers2