19

I scrapped some html via xpath, that I then converted into an etree. Something similar to this:

<td> text1 <a> link </a> text2 </td>

but when I call element.text, I only get text1 (It must be there, when I check my query in FireBug, the text of the elements is highlighted, both the text before and after the embedded anchor elements...

Kijewski
  • 25,517
  • 12
  • 101
  • 143
user522034
  • 221
  • 1
  • 3
  • 5

8 Answers8

18

Use element.xpath("string()") or lxml.etree.tostring(element, method="text") - see the documentation.

Teddy
  • 6,013
  • 3
  • 26
  • 38
  • toString(element, method="text") almost works, but it also returns the text of the embedded anchor element, which I don't want. – user522034 Jan 24 '11 at 07:36
  • element.text + child.tail works, but I wish element.text worked the way I want it to :) – user522034 Jan 24 '11 at 07:38
  • element.xpath("string()") returns same result as *.tostring(). I tried xpath("text()") which doesn't return the text of the anchor element, but it returns a list of 2 strings. Thanks for pointing out some stuff though. – user522034 Jan 24 '11 at 07:51
10

As a public service to people out there who may be as lazy as I am. Here's some code from above that you can run.

from lxml import etree

def get_text1(node):
    result = node.text or ""
    for child in node:
        if child.tail is not None:
            result += child.tail
    return result

def get_text2(node):
    return ((node.text or '') +
            ''.join(map(get_text2, node)) +
            (node.tail or ''))

def get_text3(node):
    return (node.text or "") + "".join(
        [etree.tostring(child) for child in node.iterchildren()])


root = etree.fromstring(u"<td> text1 <a> link </a> text2 </td>")

print root.xpath("text()")
print get_text1(root)
print get_text2(root)
print root.xpath("string()")
print etree.tostring(root, method = "text")
print etree.tostring(root, method = "xml")
print get_text3(root)

Output is:

snowy:rpg$ python test.py 
[' text1 ', ' text2 ']
 text1  text2 
 text1  link  text2 
 text1  link  text2 
 text1  link  text2 
<td> text1 <a> link </a> text2 </td>
 text1 <a> link </a> text2 
demented hedgehog
  • 7,007
  • 4
  • 42
  • 49
7

Another thing that seems to be working well to get the text out of an element is "".join(element.itertext())

Jonathan
  • 8,453
  • 9
  • 51
  • 74
7

looks like an lxml bug to me, but according to design if you read the documentation. I've solved it like this:

def node_text(node):
    if node.text:
        result = node.text
    else:
        result = ''
    for child in node:
        if child.tail is not None:
            result += child.tail
    return result
Jaap Versteegh
  • 761
  • 7
  • 15
  • 2
    It's not a bug, actually it's the feature that allows you to interpose text among subelements when building an XML element: http://stackoverflow.com/q/38520331/694360 – mmj Jul 22 '16 at 07:45
  • Thanks for pointing that out. I guess that is useful, but imho it would be a lot clearer if `.text` would just return the full text and some other suitably named property would contain only the part up to the first subelement. How about `node.head`. This also gives a clue that what you'll want next is `child.tail` without having to stackoverflow first. – Jaap Versteegh Jul 27 '16 at 19:36
3
<td> text1 <a> link </a> text2 </td>

Here's how it is (ignoring whitespace):

td.text == 'text1'
a.text == 'link'
a.tail == 'text2'

If you don't want a text that is inside child elements then you could collect only their tails:

text = td.text + ''.join([el.tail for el in td])
jfs
  • 399,953
  • 195
  • 994
  • 1,670
1
def get_text_recursive(node):
    return (node.text or '') + ''.join(map(get_text_recursive, node)) + (node.tail or '')
dmzkrsk
  • 2,011
  • 2
  • 20
  • 30
0

If the element is equal to <td>. You can do the following.

element.xpath('.//text()')

It will give you a list of all text elements from self (the meaning of the dot). // means that it will take all elements and finally text() is the function to extract text.

Jonathan
  • 8,453
  • 9
  • 51
  • 74
0
element.xpath('normalize-space()') also works.
softwarevamp
  • 827
  • 10
  • 14