lxml.etree, element.text doesn't return the entire text from an element

Question

I scrapped some html via xpath, that I then converted into an etree. Something similar to this:

<td> text1 <a> link </a> text2 </td>

but when I call element.text, I only get text1 (It must be there, when I check my query in FireBug, the text of the elements is highlighted, both the text before and after the embedded anchor elements...

This is one way to do it (code snippet from my little python scrape processor). Wonder if this is a lxml bug? — user522034, Jan 22 '11 at 20:44
if element.tag == "td": children = element.getchildren() if len(children) > 0: topic = (element.text + children[0].tail) else: topic = element.text print("\tTopic:\t\t%s" % topic) — user522034, Jan 22 '11 at 20:45

Teddy · Answer 1 · 2013-12-05T21:30:54.737

18

Use element.xpath("string()") or lxml.etree.tostring(element, method="text") - see the documentation.

edited Dec 05 '13 at 21:30

answered Jan 23 '11 at 01:56

Teddy

6,013
3
26
38

toString(element, method="text") almost works, but it also returns the text of the embedded anchor element, which I don't want. – user522034 Jan 24 '11 at 07:36
element.text + child.tail works, but I wish element.text worked the way I want it to :) – user522034 Jan 24 '11 at 07:38
element.xpath("string()") returns same result as *.tostring(). I tried xpath("text()") which doesn't return the text of the anchor element, but it returns a list of 2 strings. Thanks for pointing out some stuff though. – user522034 Jan 24 '11 at 07:51

demented hedgehog · Answer 2 · 2018-12-04T02:07:17.913

As a public service to people out there who may be as lazy as I am. Here's some code from above that you can run.

from lxml import etree

def get_text1(node):
    result = node.text or ""
    for child in node:
        if child.tail is not None:
            result += child.tail
    return result

def get_text2(node):
    return ((node.text or '') +
            ''.join(map(get_text2, node)) +
            (node.tail or ''))

def get_text3(node):
    return (node.text or "") + "".join(
        [etree.tostring(child) for child in node.iterchildren()])


root = etree.fromstring(u"<td> text1 <a> link </a> text2 </td>")

print root.xpath("text()")
print get_text1(root)
print get_text2(root)
print root.xpath("string()")
print etree.tostring(root, method = "text")
print etree.tostring(root, method = "xml")
print get_text3(root)

Output is:

snowy:rpg$ python test.py 
[' text1 ', ' text2 ']
 text1  text2 
 text1  link  text2 
 text1  link  text2 
 text1  link  text2 
<td> text1 <a> link </a> text2 </td>
 text1 <a> link </a> text2

score 7 · Answer 3 · answered Apr 06 '14 at 08:04

7

Another thing that seems to be working well to get the text out of an element is "".join(element.itertext())

answered Apr 06 '14 at 08:04

Jonathan

8,453
9
51
74

score 7 · Answer 4 · answered Sep 21 '11 at 13:09

7

looks like an lxml bug to me, but according to design if you read the documentation. I've solved it like this:

def node_text(node):
    if node.text:
        result = node.text
    else:
        result = ''
    for child in node:
        if child.tail is not None:
            result += child.tail
    return result

answered Sep 21 '11 at 13:09

Jaap Versteegh

761
7
15

2

It's not a bug, actually it's the feature that allows you to interpose text among subelements when building an XML element: http://stackoverflow.com/q/38520331/694360 – mmj Jul 22 '16 at 07:45
Thanks for pointing that out. I guess that is useful, but imho it would be a lot clearer if `.text` would just return the full text and some other suitably named property would contain only the part up to the first subelement. How about `node.head`. This also gives a clue that what you'll want next is `child.tail` without having to stackoverflow first. – Jaap Versteegh Jul 27 '16 at 19:36

score 3 · Answer 5 · answered Dec 08 '13 at 00:49

<td> text1 <a> link </a> text2 </td>

Here's how it is (ignoring whitespace):

td.text == 'text1'
a.text == 'link'
a.tail == 'text2'

If you don't want a text that is inside child elements then you could collect only their tails:

text = td.text + ''.join([el.tail for el in td])

score 1 · Answer 6 · answered Jan 26 '12 at 03:26

1

def get_text_recursive(node):
    return (node.text or '') + ''.join(map(get_text_recursive, node)) + (node.tail or '')

answered Jan 26 '12 at 03:26

dmzkrsk

2,011
2
20
30

score 0 · Answer 7 · answered May 23 '17 at 18:51

If the element is equal to <td>. You can do the following.

element.xpath('.//text()')

It will give you a list of all text elements from self (the meaning of the dot). // means that it will take all elements and finally text() is the function to extract text.

score 0 · Answer 8 · answered Jul 24 '17 at 03:59

0

element.xpath('normalize-space()') also works.

answered Jul 24 '17 at 03:59

softwarevamp

827
10
14

4

Only pasting code is not enough. You should also explain why it works :) – Robert Williams Jul 24 '17 at 04:24

lxml.etree, element.text doesn't return the entire text from an element

8 Answers8

Linked