3

I am using python ElementTree to read and modify some content of my html files. When I am done with changes and use ElementTree.write function,

1) it adds extra html: infront of all the tags. How should I avoid that?

2) It also adds & where I have special characters. How should i avoid that?

Thank you, Divya.

Rupesh Yadav
  • 12,096
  • 4
  • 53
  • 70
Divya
  • 71
  • 2
  • 5
  • May this be of some help ? http://stackoverflow.com/questions/780334/unescape-python-strings-from-http – Louis Sep 07 '11 at 15:44

1 Answers1

1

You can't. ElementTree works by loading the XML, parsing it, and only storing an abstract representation. It writes that out to a string by walking the abstract representation, but it doesn't remember things like which characters were escaped as entities, or whether an element was stored as <foo/> or <foo></foo> (HTML: <foo> or <foo></foo>)

Now, since ElementTree only works with XML (not HTML), I'm guessing you're working with lxml.html -- in this case, it in fact automatically corrects certain forms of erroneous HTML, because otherwise it wouldn't be able to store it correctly.

The right way to handle HTML whose data you want to be completely preserved except how you alter it, is to grab it in tokens that remember their original representation. I've done this using sgmllib, but this is imperfect -- e.g. there's a get_starttag_text method for getting the exact content of a start tag, but no corresponding method for end tags. It might be good enough anyway.

For example, to write out HTML where all the paragraphs are removed, one might write the function like this:

from cStringIO import StringIO

class SGMLModifier(sgmllib.SGMLParser):
    def __init__(self, *args, **kwargs):
        sgmllib.SGMLParser.__init__(self, *args, **kwargs)
        self._file = StringIO()

    def getvalue(self):
        return self._file.getvalue()

    def start_b(self, attributes):
        # skip it
        pass

    def end_b(self):
        # skip it
        pass

    def unknown_starttag(self, tag, attributes):
        self._file.write(self.get_starttag_text())

    def unknown_endtag(self, tag):
        # we can't get this verbatim.
        self._file.write('</%s>' % tag)

    def handle_comment(self, comment):
        # no verbatim here either.
        self._file.write('<!-- %s -->' % comment)

    def handle_data(self, data):
        self._file.write(data)

    def convert_entityref(self, ref):
        return '&' + ref + ';'

def remove_bold(html):
    parser = SGMLModifier()
    parser.feed(html)
    return parser.getvalue()

This might need a bit more work to not mangle the input. Check the documentation for details on everything.

Devin Jeanpierre
  • 92,913
  • 4
  • 55
  • 79
  • Thank you so much for the reply. Yes, after so much of study i too find that i can't use ElementTree to complete my work. – Divya Sep 07 '11 at 14:52
  • Can you please explain how you used sgmllib to get the text between tags in html file. Please exaplin with some code so that I can understand. I am new to this lib, so please help me out. – Divya Sep 07 '11 at 14:53
  • Hi, thank you so much for that. Just one more question. I have a html file. I want to give that as input file and parse it and then write back to that file. How should I do that ? Any code example which works with your above code please. – Divya Sep 07 '11 at 17:54