is it possible to fix html that has unescaped < and > characters?

Question

For example if I have this html:

<div>this is a test < text</div>

the < after the test is an error and the right html should be

<div>this is a test &lt; text</div>

But I have a lot of html files that by error were not encoded and i need fix this error so i can parse them later. The original source of data is not available so the only option is to fix this html I have.

Well, the sames applies to the > character and to text that has both < and > characters Like "<2000> - <2004>". I would like to hear ideas for algorithms or libraries that can help me. Thanks.

Note: the html sample above is a sample and the work should be done on big html files.

That isn't actually an error. HTML is allowed to have an unencoded < character if it is followed by a space. — Quentin, Dec 20 '09 at 20:29
and what about the rules for the > character? how is it allowed in html unescaped? — Karim, Dec 20 '09 at 21:19

score 1 · Accepted Answer · edited Dec 20 '09 at 20:26

1

I'd suggest this:

identify and map locations of all known tags, like <div> and </a>. Replace < and > everywhere outside the map you built in step 1.

edited Dec 20 '09 at 20:26

tvanfosson

524,688
99
697
795

answered Dec 20 '09 at 20:22

Pavel Radzivilovsky

18,794
5
57
67

score 1 · Answer 2 · answered Dec 20 '09 at 20:29

1) For all known html tags, replace <> with some other characters like {{{ and }}}. You can use regex more or less like this:

Regex.Replace(source,"</?((b|a|i|table|td|all|other|known|html|tags)( [^>]*))>","{{{$1}}}");

2) replace < with < and > with >

3) Replace {{{ with < and }}} with >

score 0 · Answer 3 · edited May 23 '17 at 10:34

0

Using a "relaxed" HTML parser like the HTML Agility Pack for .NET would be a nice fit. You grab the tree as interpreted by the library, and then, in each node value, replace < and > for their proper counterparts.

See here for an example: Iron python, beautiful soup, win32 app

edited May 23 '17 at 10:34

Community

1
1

answered Dec 20 '09 at 20:26

Vinko Vrsalovic

330,807
53
334
373

i dont think this will be possible , since replacing the < and > in each node will actually replace the child nodes and in the end i will be having a single body with big string of escaped childs – Karim Dec 20 '09 at 21:14
Nope, that won't happen as the tree is built based on recognized tags, and the actual tags are not modified with the node values. But feel free to use a more tedious and error prone approach :) – Vinko Vrsalovic Dec 21 '09 at 06:40

score 0 · Answer 4 · answered Dec 20 '09 at 20:27

0

A slow way to do it would be to treat each HTML file as an XML file. Then parse through each one of the nodes of that XML file and do a Server.HTMLEnocde on the contents of the node. Since HTML is just a defined set of XML this should work.

answered Dec 20 '09 at 20:27

Avitus

15,640
6
43
53

this wnt be possible since this html wont be considered as a valid xml. even if using tools like htmlagility pack then its not valid since it will treat this unescaped < as tags – Karim Dec 20 '09 at 21:13

is it possible to fix html that has unescaped < and > characters?

4 Answers4