How to convert XPath Element to plain html text?

Question

I have page:

<body>
  <div>
    <a id="123">text_url</a>
  </div>    
<body>

And I want to get element '//div/a' as plain html text.

<a id="123">text_url</a>

How can I do it?

from XPath point of view, `//div/a` already points to `text_url`. The rest depends on the XPath host. What is the XPath engine you're using? programming language and the Xpath library maybe? — har07, Sep 05 '14 at 11:31
python language, libs - lxml, grab. As I understand, XPath standard doesn't support this common method? — Anton Barycheuski, Sep 05 '14 at 11:36
I don't know python, maybe someone else can lead you the way. Usually, the XPath library provides a way to get node's markup. For example in .NET I can do something like : `var node = xml.SelectSingleNode("//div/a"); var nodesMarkup = node.OuterHtml;` — har07, Sep 05 '14 at 11:41
See, that isn't a matter of XPath, that's about the library API as far as I know — har07, Sep 05 '14 at 11:43
check this answer: http://stackoverflow.com/a/4624146/821594 — stalk, Sep 05 '14 at 12:12
I'm surprised that no one has mentioned [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) yet. — Robᵩ, Sep 05 '14 at 17:07

score 2 · Accepted Answer · answered Sep 05 '14 at 17:03

If you have already parsed the object using lxml, you can serialize it with lxml.etree.tostring():

from lxml import etree
xml='''<body>
  <div>
    <a id="123">text_url</a>
  </div>    
</body>'''

root = etree.fromstring(xml)
for a in root.xpath('//div/a'):
  print etree.tostring(a, method='html', with_tail=False)

score 0 · Answer 2 · answered Sep 05 '14 at 11:50

0

Working solution in python with grab module.

from grab import Grab

g = Grab()
g.go('file://page.htm')
print g.doc.select('//div/a')[0].html()

>><a id="123">text_url</a>

answered Sep 05 '14 at 11:50

Anton Barycheuski

712
2
9
21

score 0 · Answer 3 · answered Sep 05 '14 at 12:12

0

You can use re module of python with re.findall.

import re
print re.findall(r".*?(<a.*?<\/a>).*",x,re.DOTALL)

where x is x=""" text_url """

Output:['<a id="123">text_url</a>']

See demo as well.

http://regex101.com/r/lF4lY6/1

answered Sep 05 '14 at 12:12

vks

67,027
10
91
124

4

regex is not proper tool for tasks where need to extract html from some tag on complex page – Anton Barycheuski Sep 05 '14 at 12:27

score 0 · Answer 4 · answered Sep 05 '14 at 15:36

You could use the xml library in Python.

from xml.etree.ElementTree import parse

doc = parse('page.xml') # assuming page.xml is on disk
print doc.find('div/a[@id="123"]').text

Note that this would only work for strict XML. For example, you closing body tag is incorrect and this code would fail in that case. HTML on the web is rarely strict XML.

How to convert XPath Element to plain html text?

4 Answers4