2

I am using lxml to parse web document, I want to get all the text in a <p> element, so I use the code as follow:

from lxml import etree

page = etree.HTML("<html><p>test1 <br /> test2</p></html>")
print page.xpath("//p")[0].text    # this just print "test1" not "test1 <br/> test2"

The problem is I want to get all text in <p> which is test1 <br /> test2 in the example, but lxml just give me test1.

How can I get all text in <p> element?

roger
  • 9,063
  • 20
  • 72
  • 119
  • possible duplicate : [Get all text inside a tag in lxml](http://stackoverflow.com/questions/4624062/get-all-text-inside-a-tag-in-lxml) – har07 Apr 10 '15 at 07:22
  • @har07 it seems that I should use `text_content()`, but `AttributeError: 'lxml.etree._Element' object has no attribute 'html_content'` – roger Apr 10 '15 at 07:33
  • okay, since you tried using `text_content()` I assumed you want the text without `
    `. Check my answer for some possible ways
    – har07 Apr 10 '15 at 07:55
  • "*I want to get all text in `

    ` which is `test1
    test2`*". This is not correct. The actual text content is `test1 test2`. The `
    ` element is a child of `

    `, but it is not text.

    – mzjn Apr 10 '15 at 15:10

2 Answers2

2

Several other possible ways :

p = page.xpath("//p")[0]
print etree.tostring(p, method="text")

or using XPath string() function (notice that XPath position index starts from 1 instead of 0) :

page.xpath("string(//p[1])")
har07
  • 88,338
  • 12
  • 84
  • 137
1

Maybe like this

from lxml import etree

pag = etree.HTML("<html><p>test1 <br /> test2</p></html>")
# get all texts
print(pag.xpath("//p/text()"))

['test1 ', ' test2']

# concate
print("".join(pag.xpath("//p/text()")))

test1 test2

Ikarys
  • 66
  • 4