How to load and parse HTML without modifying its contents

Question

There are many ways to parse and traverse HTML4 files using many technologies. But I can not find a suitable one to save that DOM to file again.

I want to be able to load an HTML file into a DOM, change one small thing (e.g. an attribute's value), save the DOM to file again and when diffing the source file and the created file, I want them to be completely identical, except that small change.

This kind of task is absolutely no problem when working with XML and suitable XML libraries, but when it comes to HTML there are several issues: Whitespace such as indentations or linebreaks get lost or are inserted, self-closing start tags (such as <link...>) emerge as <link.../> and/or the content of CDATA sections (e.g. between <script> and </script>) is wrapped into <![CDATA[ ]]>. These things are critical in my case.

Which way can I go to load, traverse, manipulate and save HTML without the drawbacks described above, most importantly without whitespace text nodes to be altered?

Many sites currently under development (or developed in the past few years) use HTML5. Are you only concerned about HTML5 or do you want to handle HTML4, XHTML, and/or microformats as well? — devstruck, May 05 '15 at 15:25
If the "small change" is really a small change, why not read the file to a string variable, use your language provided string replacement functions or/and regular expressions. — tiblu, May 05 '15 at 15:25
@tiblu Depending on the requirements here, something that is trivial with a DOM parser (find attribute X on the 3rd Y element contained within Z) could be an absolute nightmare with standard string/regex functions — James Thorpe, May 05 '15 at 15:26
I want to handle HTML4 (no XHTML which would be much easier, no microformats). The actual task is to check and fix cross references (ids and id refs) within a wide set of quite large HTML documents and, as James Thorpe indicated, I do not think that Regex is a suitable tool for the task. I adjusted my question and removed the focus on .NET as I like the approach using a Javascript runtime. — Andre, May 06 '15 at 06:04

score 0 · Answer 1 · edited May 23 '17 at 12:06

0

comparison

If you want to get really serious leave out the GUI and go headless, SO example with Phantom

edited May 23 '17 at 12:06

Community

1
1

answered May 05 '15 at 15:52

fuzzybear

2,325
3
23
45

I will give Phantom a shot. I like the idea of using the DOM implementation of a web browser. I hope it provides enough options on preserving whitespace etc. – Andre May 06 '15 at 06:19
Turns out it does not provide options for preserving whitespace etc. – Andre May 06 '15 at 06:55
shame, i'll see if i can find more, ps keep posting updates curious as i may have to do similar – fuzzybear May 07 '15 at 01:32

score 0 · Accepted Answer · answered May 08 '15 at 07:26

0

I am going with the HTML Agility Pack. Loading and saving does not manipulate anything else than invalid parts.

answered May 08 '15 at 07:26

Andre

599
1
6
20

How to load and parse HTML without modifying its contents

2 Answers2