10

I have a HTML file:

<html>
    <p>somestr
        <sup>1</sup>
       anotherstr
    </p>
</html>

I would like to extract the text as:

somestr1anotherstr

but I can't figure out how to do it. I have written a to_sup() function that converts numeric strings to superscript so the closest I get is something like:

for i in doc.xpath('.//p/text()|.//sup/text()'):
    if i.tag == 'sup':
        print to_sup(i),
    else:
        print i,

but I ElementStringResult doesn't seem to have a method to get the tag name, so I am a bit lost. Any ideas how to solve it?

BenMorel
  • 34,448
  • 50
  • 182
  • 322
root
  • 76,608
  • 25
  • 108
  • 120
  • 1
    Well, then omit text() from the query and extract the text directly from the nodes. –  Dec 17 '12 at 10:42
  • @ user1833746 -- tried `for x in doc.xpath("//p|//sup"):print(x.text)`, but this only outputs `somestr1` – root Dec 17 '12 at 10:55

2 Answers2

9

first solution (concatenates text with no separator - see also python [lxml] - cleaning out html tags):

   import lxml.html
   document = lxml.html.document_fromstring(html_string)
   # internally does: etree.XPath("string()")(document)
   print document.text_content()

this one helped me - concatenation the way I needed:

   from lxml import etree
   print "\n".join(etree.XPath("//text()")(document))
Community
  • 1
  • 1
Robert Lujo
  • 15,383
  • 5
  • 56
  • 73
4

Just don't call text() on the sup nodes in the XPath.

for x in doc.xpath("//p/text()|//sup"):
    try:
        print(to_sup(x.text))
    except AttributeError:
        print(x)
Fred Foo
  • 355,277
  • 75
  • 744
  • 836