104

I'd like to write a code snippet that would grab all of the text inside the <content> tag, in lxml, in all three instances below, including the code tags. I've tried tostring(getchildren()) but that would miss the text in between the tags. I didn't have very much luck searching the API for a relevant function. Could you help me out?

<!--1-->
<content>
<div>Text inside tag</div>
</content>
#should return "<div>Text inside tag</div>

<!--2-->
<content>
Text with no tag
</content>
#should return "Text with no tag"


<!--3-->
<content>
Text outside tag <div>Text inside tag</div>
</content>
#should return "Text outside tag <div>Text inside tag</div>"
Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
Kevin Burke
  • 61,194
  • 76
  • 188
  • 305
  • 1
    Thanks - I was trying to write an RSS feed parser and display everything inside the tag, which includes HTML tags from the feed provider. – Kevin Burke Jan 07 '11 at 21:16

15 Answers15

102

Just use the node.itertext() method, as in:

 ''.join(node.itertext())
vvvvv
  • 25,404
  • 19
  • 49
  • 81
Arthur Debert
  • 10,237
  • 5
  • 26
  • 21
  • 3
    This works great, but strips out any tags that you might want. – Yablargo Jan 15 '14 at 14:50
  • Should the string not have a space in it? Or am I missing something? – Private Apr 23 '15 at 09:56
  • 1
    @Private It depends on your specific needs. For instance I could have markup like `
    con
    gregate
    ` to indicate a prefix in a word. Let's say I want to extract the word without markup. If I use `.join` with a space, then I'd get `"con gregate"` whereas without a space I get `"congregate"`.
    – Louis Sep 01 '15 at 20:18
  • While the answer above was accepted, this is what I actually wanted. – jason m May 09 '20 at 18:25
91

Does text_content() do what you need?

Jacob Marble
  • 28,555
  • 22
  • 67
  • 78
Ed Summers
  • 1,351
  • 10
  • 7
  • 6
    text_content() removes all markup and the OP wants to keep the markup that is inside the tag. – benselme Oct 29 '13 at 20:31
  • 11
    @benselme why I use `text_content`, it says `AttributeError: 'lxml.etree._Element' object has no attribute 'text_content'` – roger Apr 10 '15 at 07:38
  • 8
    @roger `text_content()` is available only if your tree is HTML (i.e. if it was parsed with the methods in `lxml.html`). – Louis Jul 15 '15 at 19:14
  • @EdSummers Thanks a lot! This is useful while parsing a `

    ` tag. I was missing text (like nested links) while using `text()` in XPath, but your method worked for me!.

    – Sam Chats Jul 06 '17 at 09:21
  • 3
    As Louis noted, this works only for trees parsed using `lxml.html`. Arthur Debert's solution with `itertext()` is universal. – SergiyKolesnikov Feb 22 '19 at 13:26
  • To make it clearer, `text_content` is HTML Element method, `itertext` is Element method. – sfy Nov 10 '22 at 18:14
48

Try:

def stringify_children(node):
    from lxml.etree import tostring
    from itertools import chain
    parts = ([node.text] +
            list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
            [node.tail])
    # filter removes possible Nones in texts and tails
    return ''.join(filter(None, parts))

Example:

from lxml import etree
node = etree.fromstring("""<content>
Text outside tag <div>Text <em>inside</em> tag</div>
</content>""")
stringify_children(node)

Produces: '\nText outside tag <div>Text <em>inside</em> tag</div>\n'

albertov
  • 2,314
  • 20
  • 15
  • You propably want this to be recursive, in case of e.g. `Text outside tag
    Text inside tag
    `.
    –  Jan 07 '11 at 10:04
  • 2
    @delnan. It is not needed, `tostring` already handles the recursive case. You've made me doubt so I tried it out on real code and updated the answer with an example. Thanks for pointing it out. – albertov Jan 07 '11 at 13:36
  • @delnan. Ok, I think I've got you now... You were referring to the text and tails of children, right? Fixed the answer taking that into account. – albertov Jan 07 '11 at 14:02
  • 5
    Code is broken and produces duplicate content: >>> stringify_children(lxmlhtml.fromstring('A
    B
    C')) 'A

    A

    B
    B
    CC'
    – hoju Jan 09 '13 at 23:43
  • 1
    To fix the bug @hoju reported, add `with_tail=False` as a parameter to `tostring()`. So `tostring(c, with_tail=False)`. This will fix the problem with the tail text (`C`). For fixing the problem with the prefix text (`A`), this seems to be a bug in `tostring()` that adds the `

    ` tag, so it's not a bug in OP's code.

    – anana Jan 27 '15 at 14:38
  • 1
    Second bug can be fixed by removing `c.text` from the `parts` list. I submitted a new answer with these bugs fixed. – anana Jan 27 '15 at 15:20
  • 5
    Should add `tostring(c, encoding=str)` to be run on Python 3. – Antoine Dusséaux Jan 09 '17 at 22:12
22

A version of albertov 's stringify-content that solves the bugs reported by hoju:

def stringify_children(node):
    from lxml.etree import tostring
    from itertools import chain
    return ''.join(
        chunk for chunk in chain(
            (node.text,),
            chain(*((tostring(child, with_tail=False), child.tail) for child in node.getchildren())),
            (node.tail,)) if chunk)
Peter Varo
  • 11,726
  • 7
  • 55
  • 77
anana
  • 1,461
  • 10
  • 11
22

The following snippet which uses python generators works perfectly and is very efficient.

''.join(node.itertext()).strip()

Sandeep
  • 28,307
  • 3
  • 32
  • 24
  • 1
    If the node is acquired from and indented text, depending on the parser it will usually have the indentation text, which itertext() will interweave in the normal text snippets. Depending on the actual setup, the following may be useful: `' '.join(node.itertext('span', 'b'))` - only use the text from `` and `` tags, discarding the tags with "\n " from the indentation. – Zoltan K. Apr 08 '18 at 10:28
6

Defining stringify_children this way may be less complicated:

from lxml import etree

def stringify_children(node):
    s = node.text
    if s is None:
        s = ''
    for child in node:
        s += etree.tostring(child, encoding='unicode')
    return s

or in one line

return (node.text if node.text is not None else '') + ''.join((etree.tostring(child, encoding='unicode') for child in node))

Rationale is the same as in this answer: leave the serialization of child nodes to lxml. The tail part of node in this case isn't interesting since it is "behind" the end tag. Note that the encoding argument may be changed according to one's needs.

Another possible solution is to serialize the node itself and afterwards, strip the start and end tag away:

def stringify_children(node):
    s = etree.tostring(node, encoding='unicode', with_tail=False)
    return s[s.index(node.tag) + 1 + len(node.tag): s.rindex(node.tag) - 2]

which is somewhat horrible. This code is correct only if node has no attributes, and I don't think anyone would want to use it even then.

Percival Ulysses
  • 1,133
  • 11
  • 18
  • 1
    `node.text if node.text is not None else ''` can be just `node.txt or ''` – yprez Mar 11 '16 at 19:41
  • Playing Lazarus a bit here (resurrection joke... not punny), but I've seen this post a number of times when I couldn't remember exactly what I did. Given node.text only returns the text not seen as part of the iterator (when iterating directly into a node, same as node.getChildren() I believe), it seems the solution could easily be simplified down from this to: `''.join([node.text or ''] + [etree.tostring(e) for e in node])` – Tim Alexander Jul 03 '17 at 18:11
  • This one actually works with python 3, whereas the most upvoted answer does not. – Andrey Feb 19 '20 at 06:01
6

One of the simplest code snippets, that actually worked for me and as per documentation at http://lxml.de/tutorial.html#using-xpath-to-find-text is

etree.tostring(html, method="text")

where etree is a node/tag whose complete text, you are trying to read. Behold that it doesn't get rid of script and style tags though.

Deepan Prabhu Babu
  • 862
  • 11
  • 18
5
import urllib2
from lxml import etree
url = 'some_url'

getting url

test = urllib2.urlopen(url)
page = test.read()

getting all html code within including table tag

tree = etree.HTML(page)

xpath selector

table = tree.xpath("xpath_here")
res = etree.tostring(table)

res is the html code of table this was doing job for me.

so you can extract the tags content with xpath_text() and tags including their content using tostring()

div = tree.xpath("//div")
div_res = etree.tostring(div)
text = tree.xpath_text("//content") 

or text = tree.xpath("//content/text()")

div_3 = tree.xpath("//content")
div_3_res = etree.tostring(div_3).strip('<content>').rstrip('</')

this last line with strip method using is not nice, but it just works

d3day
  • 872
  • 8
  • 12
  • For me, this works well enough and is admittedly much simpler. I know that I have a
    tag -- every time -- and I can strip it out
    – Yablargo Jan 15 '14 at 15:10
  • 1
    Has `xpath_text` already been removed from lxml? It says `AttributeError: 'lxml.etree._Element' object has no attribute 'xpath_text'` – roger Apr 10 '15 at 07:40
3

Just a quick enhancement as the answer has been given. If you want to clean the inside text:

clean_string = ' '.join([n.strip() for n in node.itertext()]).strip()
inverted_index
  • 2,329
  • 21
  • 40
2

In response to @Richard's comment above, if you patch stringify_children to read:

 parts = ([node.text] +
--            list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
++            list(chain(*([tostring(c)] for c in node.getchildren()))) +
           [node.tail])

it seems to avoid the duplication he refers to.

1

I know that this is an old question, but this is a common problem and I have a solution that seems simpler than the ones suggested so far:

def stringify_children(node):
    """Given a LXML tag, return contents as a string

       >>> html = "<p><strong>Sample sentence</strong> with tags.</p>"
       >>> node = lxml.html.fragment_fromstring(html)
       >>> extract_html_content(node)
       "<strong>Sample sentence</strong> with tags."
    """
    if node is None or (len(node) == 0 and not getattr(node, 'text', None)):
        return ""
    node.attrib.clear()
    opening_tag = len(node.tag) + 2
    closing_tag = -(len(node.tag) + 3)
    return lxml.html.tostring(node)[opening_tag:closing_tag]

Unlike some of the other answers to this question this solution preserves all of tags contained within it and attacks the problem from a different angle than the other working solutions.

Joshmaker
  • 4,068
  • 3
  • 27
  • 29
1

Here is a working solution. We can get content with a parent tag and then cut the parent tag from output.

import re
from lxml import etree

def _tostr_with_tags(parent_element, html_entities=False):
    RE_CUT = r'^<([\w-]+)>(.*)</([\w-]+)>$' 
    content_with_parent = etree.tostring(parent_element)    

    def _replace_html_entities(s):
        RE_ENTITY = r'&#(\d+);'

        def repl(m):
            return unichr(int(m.group(1)))

        replaced = re.sub(RE_ENTITY, repl, s, flags=re.MULTILINE|re.UNICODE)

        return replaced

    if not html_entities:
        content_with_parent = _replace_html_entities(content_with_parent)

    content_with_parent = content_with_parent.strip() # remove 'white' characters on margins

    start_tag, content_without_parent, end_tag = re.findall(RE_CUT, content_with_parent, flags=re.UNICODE|re.MULTILINE|re.DOTALL)[0]

    if start_tag != end_tag:
        raise Exception('Start tag does not match to end tag while getting content with tags.')

    return content_without_parent

parent_element must have Element type.

Please note, that if you want text content (not html entities in text) please leave html_entities parameter as False.

sergzach
  • 6,578
  • 7
  • 46
  • 84
0

lxml have a method for that:

node.text_content()
Hrabal
  • 2,403
  • 2
  • 20
  • 30
  • 4
    This answer does not add anything new. The same as https://stackoverflow.com/a/11963661/407651. – mzjn Oct 08 '17 at 12:51
  • The lxml documentation also appears to be wrong: `AttributeError: 'lxml.etree._Element' object has no attribute 'text_content'` – mike rodent Aug 28 '21 at 13:41
  • @mikerodent https://stackoverflow.com/questions/4624062/get-all-text-inside-a-tag-in-lxml/11963661#comment50847722_11963661 – Luka Banfi Jul 27 '22 at 08:11
  • @LukaBanfi Thanks. Also that's clever to have found out how to link to a comment: are you a wizard? – mike rodent Jul 27 '22 at 08:17
-2

If this is an a tag, you can try:

node.values()
René Höhle
  • 26,716
  • 22
  • 73
  • 82
David
  • 15
  • 2
-2
import re
from lxml import etree

node = etree.fromstring("""
<content>Text before inner tag
    <div>Text
        <em>inside</em>
        tag
    </div>
    Text after inner tag
</content>""")

print re.search("\A<[^<>]*>(.*)</[^<>]*>\Z", etree.tostring(node), re.DOTALL).group(1) 
kazufusa
  • 41
  • 3