0

So i have a need to process some HTML in Python, and my requirement is that i need to find a certain tag and replace it with different charecter based on the content of the charecters...

<html>
   <Head>
   </HEAD>
   <body>
     <blah>
       <_translate attr="french"> I am no one, 
           and no where <_translate>
     <Blah/>
   </body>
 </html>

Should become

<html>
   <Head>
   </HEAD>
   <body>
     <blah>
       Je suis personne et je suis nulle part
     <Blah/>
   </body>
</html>

I would like to leave the original HTML untouched an only replace the tags labeled 'important-tag'. Attributes and the contents of that tag will be important to generate the tags output.

I had though about using extending HTMLParser Object but I am having trouble getting out the orginal HTML when i want it. I think what i most want is to parse the HTML into tokens, with the orginal text in each token so i can output my desired output ... i.e. get somthing like

(tag, "<html>")
(data, "\n    ")
(tag, "<head>")
(data, "\n    ")
(end-tag,"</HEAD>")
ect...
ect...

Anyone know of a good pythonic way to accomplish this ? Python 2.7 standard libs are prefered, third party would also be useful to consider...

Thanks!

Jon Clements
  • 138,671
  • 33
  • 247
  • 280
gbtimmon
  • 4,238
  • 1
  • 21
  • 36
  • It looks like that issue was covered here: http://stackoverflow.com/questions/717541/parsing-html-in-python. – Flavio Garcia Nov 11 '13 at 16:43
  • No it wasnt. I am aware of the DOM and HTMLParsers, I need a way to Parse and to preserve original input OR a way to perform Lexical Analysis on HTML. The existing Parsers dont seem to do it in a straight forward way... – gbtimmon Nov 11 '13 at 16:54

1 Answers1

2

You can use lxml to perform such a task http://lxml.de/tutorial.html and use XPath to navigate easily trough your html:

from lxml.html import fromstring
my_html = "HTML CONTENT"
root = fromstring(my_html)
nodes_to_process = root.xpath("//_translate")
for node in nodes_to_process:
    lang = node.attrib["attr"]
    translate = AWESOME_TRANSLATE(node.text, lang)
    node.parent.text = translate

I'll leave up to you the implementation of the AWESOME_TRANSLATE function ;)

Ketouem
  • 3,820
  • 1
  • 19
  • 29