Get all text inside a tag in lxml

Question

I'd like to write a code snippet that would grab all of the text inside the <content> tag, in lxml, in all three instances below, including the code tags. I've tried tostring(getchildren()) but that would miss the text in between the tags. I didn't have very much luck searching the API for a relevant function. Could you help me out?

<!--1-->
<content>
<div>Text inside tag</div>
</content>
#should return "<div>Text inside tag</div>

<!--2-->
<content>
Text with no tag
</content>
#should return "Text with no tag"


<!--3-->
<content>
Text outside tag <div>Text inside tag</div>
</content>
#should return "Text outside tag <div>Text inside tag</div>"

Thanks - I was trying to write an RSS feed parser and display everything inside the tag, which includes HTML tags from the feed provider. — Kevin Burke, Jan 07 '11 at 21:16

score 102 · Answer 1 · edited Jun 12 '18 at 21:46

102

Just use the node.itertext() method, as in:

 ''.join(node.itertext())

edited Jun 12 '18 at 21:46

vvvvv

25,404
19
49
81

answered Feb 25 '13 at 19:00

Arthur Debert

10,237
5
26
21

3

This works great, but strips out any tags that you might want. – Yablargo Jan 15 '14 at 14:50
Should the string not have a space in it? Or am I missing something? – Private Apr 23 '15 at 09:56
1
@Private It depends on your specific needs. For instance I could have markup like `
con
gregate
` to indicate a prefix in a word. Let's say I want to extract the word without markup. If I use `.join` with a space, then I'd get `"con gregate"` whereas without a space I get `"congregate"`.
– Louis Sep 01 '15 at 20:18
While the answer above was accepted, this is what I actually wanted. – jason m May 09 '20 at 18:25

score 91 · Answer 2 · edited May 09 '14 at 20:49

91

Does text_content() do what you need?

edited May 09 '14 at 20:49

Jacob Marble

28,555
22
67
78

answered Aug 15 '12 at 03:14

Ed Summers

1,351
10
7

6

text_content() removes all markup and the OP wants to keep the markup that is inside the tag. – benselme Oct 29 '13 at 20:31
11

@benselme why I use `text_content`, it says `AttributeError: 'lxml.etree._Element' object has no attribute 'text_content'` – roger Apr 10 '15 at 07:38
8

@roger `text_content()` is available only if your tree is HTML (i.e. if it was parsed with the methods in `lxml.html`). – Louis Jul 15 '15 at 19:14
@EdSummers Thanks a lot! This is useful while parsing a `
` tag. I was missing text (like nested links) while using `text()` in XPath, but your method worked for me!.
– Sam Chats Jul 06 '17 at 09:21
3

As Louis noted, this works only for trees parsed using `lxml.html`. Arthur Debert's solution with `itertext()` is universal. – SergiyKolesnikov Feb 22 '19 at 13:26
To make it clearer, `text_content` is HTML Element method, `itertext` is Element method. – sfy Nov 10 '22 at 18:14

albertov · Accepted Answer · 2011-01-08T23:24:39.440

48

Try:

def stringify_children(node):
    from lxml.etree import tostring
    from itertools import chain
    parts = ([node.text] +
            list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
            [node.tail])
    # filter removes possible Nones in texts and tails
    return ''.join(filter(None, parts))

Example:

from lxml import etree
node = etree.fromstring("""<content>
Text outside tag <div>Text <em>inside</em> tag</div>
</content>""")
stringify_children(node)

Produces: '\nText outside tag <div>Text <em>inside</em> tag</div>\n'

edited Jan 08 '11 at 23:24

answered Jan 07 '11 at 09:35

albertov

2,314
20
15

You propably want this to be recursive, in case of e.g. `Text outside tag
Text inside tag
`. – Jan 07 '11 at 10:04
2

@delnan. It is not needed, `tostring` already handles the recursive case. You've made me doubt so I tried it out on real code and updated the answer with an example. Thanks for pointing it out. – albertov Jan 07 '11 at 13:36
@delnan. Ok, I think I've got you now... You were referring to the text and tails of children, right? Fixed the answer taking that into account. – albertov Jan 07 '11 at 14:02
5

Code is broken and produces duplicate content: >>> stringify_children(lxmlhtml.fromstring('A
B
C')) 'A
A
B
B
CC' – hoju Jan 09 '13 at 23:43
1

To fix the bug @hoju reported, add `with_tail=False` as a parameter to `tostring()`. So `tostring(c, with_tail=False)`. This will fix the problem with the tail text (`C`). For fixing the problem with the prefix text (`A`), this seems to be a bug in `tostring()` that adds the `
` tag, so it's not a bug in OP's code.
– anana Jan 27 '15 at 14:38
1

Second bug can be fixed by removing `c.text` from the `parts` list. I submitted a new answer with these bugs fixed. – anana Jan 27 '15 at 15:20
5

Should add `tostring(c, encoding=str)` to be run on Python 3. – Antoine Dusséaux Jan 09 '17 at 22:12

score 22 · Answer 4 · edited Jan 23 '18 at 11:22

A version of albertov 's stringify-content that solves the bugs reported by hoju:

def stringify_children(node):
    from lxml.etree import tostring
    from itertools import chain
    return ''.join(
        chunk for chunk in chain(
            (node.text,),
            chain(*((tostring(child, with_tail=False), child.tail) for child in node.getchildren())),
            (node.tail,)) if chunk)

Sandeep · Answer 5 · 2016-06-27T11:46:57.330

22

The following snippet which uses python generators works perfectly and is very efficient.

''.join(node.itertext()).strip()

edited Jun 27 '16 at 11:46

answered Jun 27 '16 at 11:08

Sandeep

28,307
3
32
24

1

If the node is acquired from and indented text, depending on the parser it will usually have the indentation text, which itertext() will interweave in the normal text snippets. Depending on the actual setup, the following may be useful: `' '.join(node.itertext('span', 'b'))` - only use the text from `` and `` tags, discarding the tags with "\n " from the indentation. – Zoltan K. Apr 08 '18 at 10:28

Percival Ulysses · Answer 6 · 2014-06-13T19:40:23.330

6

Defining stringify_children this way may be less complicated:

from lxml import etree

def stringify_children(node):
    s = node.text
    if s is None:
        s = ''
    for child in node:
        s += etree.tostring(child, encoding='unicode')
    return s

or in one line

return (node.text if node.text is not None else '') + ''.join((etree.tostring(child, encoding='unicode') for child in node))

Rationale is the same as in this answer: leave the serialization of child nodes to lxml. The tail part of node in this case isn't interesting since it is "behind" the end tag. Note that the encoding argument may be changed according to one's needs.

Another possible solution is to serialize the node itself and afterwards, strip the start and end tag away:

def stringify_children(node):
    s = etree.tostring(node, encoding='unicode', with_tail=False)
    return s[s.index(node.tag) + 1 + len(node.tag): s.rindex(node.tag) - 2]

which is somewhat horrible. This code is correct only if node has no attributes, and I don't think anyone would want to use it even then.

edited Jun 13 '14 at 19:40

answered Jun 10 '14 at 22:26

Percival Ulysses

1,133
11
18

1

`node.text if node.text is not None else ''` can be just `node.txt or ''` – yprez Mar 11 '16 at 19:41
Playing Lazarus a bit here (resurrection joke... not punny), but I've seen this post a number of times when I couldn't remember exactly what I did. Given node.text only returns the text not seen as part of the iterator (when iterating directly into a node, same as node.getChildren() I believe), it seems the solution could easily be simplified down from this to: `''.join([node.text or ''] + [etree.tostring(e) for e in node])` – Tim Alexander Jul 03 '17 at 18:11
This one actually works with python 3, whereas the most upvoted answer does not. – Andrey Feb 19 '20 at 06:01

score 6 · Answer 7 · answered Jul 05 '17 at 06:53

6

One of the simplest code snippets, that actually worked for me and as per documentation at http://lxml.de/tutorial.html#using-xpath-to-find-text is

etree.tostring(html, method="text")

where etree is a node/tag whose complete text, you are trying to read. Behold that it doesn't get rid of script and style tags though.

answered Jul 05 '17 at 06:53

Deepan Prabhu Babu

862
11
18

4

strips the html tags – Dennis Golomazov May 01 '18 at 07:19

d3day · Answer 8 · 2012-08-20T20:11:47.477

5

import urllib2
from lxml import etree
url = 'some_url'

getting url

test = urllib2.urlopen(url)
page = test.read()

getting all html code within including table tag

tree = etree.HTML(page)

xpath selector

table = tree.xpath("xpath_here")
res = etree.tostring(table)

res is the html code of table this was doing job for me.

so you can extract the tags content with xpath_text() and tags including their content using tostring()

div = tree.xpath("//div")
div_res = etree.tostring(div)

text = tree.xpath_text("//content")

or text = tree.xpath("//content/text()")

div_3 = tree.xpath("//content")
div_3_res = etree.tostring(div_3).strip('<content>').rstrip('</')

this last line with strip method using is not nice, but it just works

edited Aug 20 '12 at 20:11

answered Aug 19 '12 at 01:14

d3day

872
8
12

For me, this works well enough and is admittedly much simpler. I know that I have a
tag -- every time -- and I can strip it out – Yablargo Jan 15 '14 at 15:10
1

Has `xpath_text` already been removed from lxml? It says `AttributeError: 'lxml.etree._Element' object has no attribute 'xpath_text'` – roger Apr 10 '15 at 07:40

score 3 · Answer 9 · answered Apr 06 '20 at 02:12

3

Just a quick enhancement as the answer has been given. If you want to clean the inside text:

clean_string = ' '.join([n.strip() for n in node.itertext()]).strip()

answered Apr 06 '20 at 02:12

inverted_index

2,329
21
40

score 2 · Answer 10 · answered Apr 30 '13 at 16:18

In response to @Richard's comment above, if you patch stringify_children to read:

 parts = ([node.text] +
--            list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
++            list(chain(*([tostring(c)] for c in node.getchildren()))) +
           [node.tail])

it seems to avoid the duplication he refers to.

score 1 · Answer 11 · answered Sep 08 '15 at 22:22

I know that this is an old question, but this is a common problem and I have a solution that seems simpler than the ones suggested so far:

def stringify_children(node):
    """Given a LXML tag, return contents as a string

       >>> html = "<p><strong>Sample sentence</strong> with tags.</p>"
       >>> node = lxml.html.fragment_fromstring(html)
       >>> extract_html_content(node)
       "<strong>Sample sentence</strong> with tags."
    """
    if node is None or (len(node) == 0 and not getattr(node, 'text', None)):
        return ""
    node.attrib.clear()
    opening_tag = len(node.tag) + 2
    closing_tag = -(len(node.tag) + 3)
    return lxml.html.tostring(node)[opening_tag:closing_tag]

Unlike some of the other answers to this question this solution preserves all of tags contained within it and attacks the problem from a different angle than the other working solutions.

sergzach · Answer 12 · 2017-08-18T18:09:11.890

Here is a working solution. We can get content with a parent tag and then cut the parent tag from output.

import re
from lxml import etree

def _tostr_with_tags(parent_element, html_entities=False):
    RE_CUT = r'^<([\w-]+)>(.*)</([\w-]+)>$' 
    content_with_parent = etree.tostring(parent_element)    

    def _replace_html_entities(s):
        RE_ENTITY = r'&#(\d+);'

        def repl(m):
            return unichr(int(m.group(1)))

        replaced = re.sub(RE_ENTITY, repl, s, flags=re.MULTILINE|re.UNICODE)

        return replaced

    if not html_entities:
        content_with_parent = _replace_html_entities(content_with_parent)

    content_with_parent = content_with_parent.strip() # remove 'white' characters on margins

    start_tag, content_without_parent, end_tag = re.findall(RE_CUT, content_with_parent, flags=re.UNICODE|re.MULTILINE|re.DOTALL)[0]

    if start_tag != end_tag:
        raise Exception('Start tag does not match to end tag while getting content with tags.')

    return content_without_parent

parent_element must have Element type.

Please note, that if you want text content (not html entities in text) please leave html_entities parameter as False.

_tostr_with_tags(root_element_whose_inner_content_to_include)? — sergzach, Nov 30 '22 at 00:57

score 0 · Answer 13 · answered Oct 08 '17 at 08:36

0

lxml have a method for that:

node.text_content()

answered Oct 08 '17 at 08:36

Hrabal

2,403
2
20
30

4

This answer does not add anything new. The same as https://stackoverflow.com/a/11963661/407651. – mzjn Oct 08 '17 at 12:51
The lxml documentation also appears to be wrong: `AttributeError: 'lxml.etree._Element' object has no attribute 'text_content'` – mike rodent Aug 28 '21 at 13:41
@mikerodent https://stackoverflow.com/questions/4624062/get-all-text-inside-a-tag-in-lxml/11963661#comment50847722_11963661 – Luka Banfi Jul 27 '22 at 08:11
@LukaBanfi Thanks. Also that's clever to have found out how to link to a comment: are you a wizard? – mike rodent Jul 27 '22 at 08:17

score -2 · Answer 14 · edited Nov 14 '12 at 16:51

-2

If this is an a tag, you can try:

node.values()

edited Nov 14 '12 at 16:51

René Höhle

26,716
22
73
82

answered Nov 14 '12 at 16:30

David

15
2

1

This doesn't get the text inside the tag, it gets the attributes inside the tag. – Timothy P. Jurka Feb 01 '13 at 19:28

score -2 · Answer 15 · answered Jan 08 '15 at 00:59

import re
from lxml import etree

node = etree.fromstring("""
<content>Text before inner tag
    <div>Text
        <em>inside</em>
        tag
    </div>
    Text after inner tag
</content>""")

print re.search("\A<[^<>]*>(.*)</[^<>]*>\Z", etree.tostring(node), re.DOTALL).group(1)

Get all text inside a tag in lxml

15 Answers15

Linked

Related