I have page:
<body>
<div>
<a id="123">text_url</a>
</div>
<body>
And I want to get element '//div/a' as plain html text.
<a id="123">text_url</a>
How can I do it?
I have page:
<body>
<div>
<a id="123">text_url</a>
</div>
<body>
And I want to get element '//div/a' as plain html text.
<a id="123">text_url</a>
How can I do it?
If you have already parsed the object using lxml
, you can serialize it with lxml.etree.tostring()
:
from lxml import etree
xml='''<body>
<div>
<a id="123">text_url</a>
</div>
</body>'''
root = etree.fromstring(xml)
for a in root.xpath('//div/a'):
print etree.tostring(a, method='html', with_tail=False)
Working solution in python with grab module.
from grab import Grab
g = Grab()
g.go('file://page.htm')
print g.doc.select('//div/a')[0].html()
>><a id="123">text_url</a>
You can use re module of python with re.findall.
import re
print re.findall(r".*?(<a.*?<\/a>).*",x,re.DOTALL)
where x is x=""" text_url """
Output:['<a id="123">text_url</a>']
See demo as well.
You could use the xml library in Python.
from xml.etree.ElementTree import parse
doc = parse('page.xml') # assuming page.xml is on disk
print doc.find('div/a[@id="123"]').text
Note that this would only work for strict XML. For example, you closing body tag is incorrect and this code would fail in that case. HTML on the web is rarely strict XML.