2

I have page:

<body>
  <div>
    <a id="123">text_url</a>
  </div>    
<body>

And I want to get element '//div/a' as plain html text.

<a id="123">text_url</a>

How can I do it?

har07
  • 88,338
  • 12
  • 84
  • 137
Anton Barycheuski
  • 712
  • 2
  • 9
  • 21
  • 1
    from XPath point of view, `//div/a` already points to `text_url`. The rest depends on the XPath host. What is the XPath engine you're using? programming language and the Xpath library maybe? – har07 Sep 05 '14 at 11:31
  • python language, libs - lxml, grab. As I understand, XPath standard doesn't support this common method? – Anton Barycheuski Sep 05 '14 at 11:36
  • 2
    I don't know python, maybe someone else can lead you the way. Usually, the XPath library provides a way to get node's markup. For example in .NET I can do something like : `var node = xml.SelectSingleNode("//div/a"); var nodesMarkup = node.OuterHtml;` – har07 Sep 05 '14 at 11:41
  • 2
    See, that isn't a matter of XPath, that's about the library API as far as I know – har07 Sep 05 '14 at 11:43
  • check this answer: http://stackoverflow.com/a/4624146/821594 – stalk Sep 05 '14 at 12:12
  • I'm surprised that no one has mentioned [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) yet. – Robᵩ Sep 05 '14 at 17:07

4 Answers4

2

If you have already parsed the object using lxml, you can serialize it with lxml.etree.tostring():

from lxml import etree
xml='''<body>
  <div>
    <a id="123">text_url</a>
  </div>    
</body>'''

root = etree.fromstring(xml)
for a in root.xpath('//div/a'):
  print etree.tostring(a, method='html', with_tail=False)
Robᵩ
  • 163,533
  • 20
  • 239
  • 308
0

Working solution in python with grab module.

from grab import Grab

g = Grab()
g.go('file://page.htm')
print g.doc.select('//div/a')[0].html()

>><a id="123">text_url</a>
Anton Barycheuski
  • 712
  • 2
  • 9
  • 21
0

You can use re module of python with re.findall.

import re
print re.findall(r".*?(<a.*?<\/a>).*",x,re.DOTALL)

where x is x=""" text_url """

Output:['<a id="123">text_url</a>']

See demo as well.

http://regex101.com/r/lF4lY6/1

vks
  • 67,027
  • 10
  • 91
  • 124
0

You could use the xml library in Python.

from xml.etree.ElementTree import parse

doc = parse('page.xml') # assuming page.xml is on disk
print doc.find('div/a[@id="123"]').text

Note that this would only work for strict XML. For example, you closing body tag is incorrect and this code would fail in that case. HTML on the web is rarely strict XML.

Saish
  • 511
  • 3
  • 11