lxml doesn't get all text in element if text has
?

Question

I am using lxml to parse web document, I want to get all the text in a <p> element, so I use the code as follow:

from lxml import etree

page = etree.HTML("<html><p>test1 <br /> test2</p></html>")
print page.xpath("//p")[0].text    # this just print "test1" not "test1 <br/> test2"

The problem is I want to get all text in <p> which is test1 <br /> test2 in the example, but lxml just give me test1.

How can I get all text in <p> element?

possible duplicate : [Get all text inside a tag in lxml](http://stackoverflow.com/questions/4624062/get-all-text-inside-a-tag-in-lxml) — har07, Apr 10 '15 at 07:22
@har07 it seems that I should use `text_content()`, but `AttributeError: 'lxml.etree._Element' object has no attribute 'html_content'` — roger, Apr 10 '15 at 07:33
okay, since you tried using `text_content()` I assumed you want the text without `
`. Check my answer for some possible ways — har07, Apr 10 '15 at 07:55
"*I want to get all text in `
` which is `test1
test2`*". This is not correct. The actual text content is `test1 test2`. The `
` element is a child of `
`, but it is not text. — mzjn, Apr 10 '15 at 15:10

har07 · Answer 1 · 2015-04-10T07:56:41.330

2

Several other possible ways :

p = page.xpath("//p")[0]
print etree.tostring(p, method="text")

or using XPath string() function (notice that XPath position index starts from 1 instead of 0) :

page.xpath("string(//p[1])")

edited Apr 10 '15 at 07:56

answered Apr 10 '15 at 07:50

har07

88,338
12
84
137

score 1 · Answer 2 · answered Apr 10 '15 at 07:48

1

Maybe like this

from lxml import etree

pag = etree.HTML("<html><p>test1 <br /> test2</p></html>")
# get all texts
print(pag.xpath("//p/text()"))

['test1 ', ' test2']

# concate
print("".join(pag.xpath("//p/text()")))

test1 test2

answered Apr 10 '15 at 07:48

Ikarys

66
4

lxml doesn't get all text in element if text has ?

2 Answers2

lxml doesn't get all text in element if text has
?