There are many ways to parse and traverse HTML4 files using many technologies. But I can not find a suitable one to save that DOM to file again.
I want to be able to load an HTML file into a DOM, change one small thing (e.g. an attribute's value), save the DOM to file again and when diffing the source file and the created file, I want them to be completely identical, except that small change.
This kind of task is absolutely no problem when working with XML and suitable XML libraries, but when it comes to HTML there are several issues: Whitespace such as indentations or linebreaks get lost or are inserted, self-closing start tags (such as <link...>
) emerge as <link.../>
and/or the content of CDATA sections (e.g. between <script>
and </script>
) is wrapped into <![CDATA[
]]>
. These things are critical in my case.
Which way can I go to load, traverse, manipulate and save HTML without the drawbacks described above, most importantly without whitespace text nodes to be altered?