0

For example if I have this html:

<div>this is a test < text</div>

the < after the test is an error and the right html should be

<div>this is a test &lt; text</div>

But I have a lot of html files that by error were not encoded and i need fix this error so i can parse them later. The original source of data is not available so the only option is to fix this html I have.

Well, the sames applies to the > character and to text that has both < and > characters Like "<2000> - <2004>". I would like to hear ideas for algorithms or libraries that can help me. Thanks.

Note: the html sample above is a sample and the work should be done on big html files.

Michael Petrotta
  • 59,888
  • 27
  • 145
  • 179
Karim
  • 6,113
  • 18
  • 58
  • 83

4 Answers4

1

I'd suggest this:

identify and map locations of all known tags, like <div> and </a>. Replace < and > everywhere outside the map you built in step 1.

tvanfosson
  • 524,688
  • 99
  • 697
  • 795
Pavel Radzivilovsky
  • 18,794
  • 5
  • 57
  • 67
1

1) For all known html tags, replace <> with some other characters like {{{ and }}}. You can use regex more or less like this:

Regex.Replace(source,"</?((b|a|i|table|td|all|other|known|html|tags)( [^>]*))>","{{{$1}}}");

2) replace < with < and > with >

3) Replace {{{ with < and }}} with >

yu_sha
  • 4,290
  • 22
  • 19
0

Using a "relaxed" HTML parser like the HTML Agility Pack for .NET would be a nice fit. You grab the tree as interpreted by the library, and then, in each node value, replace < and > for their proper counterparts.

See here for an example: Iron python, beautiful soup, win32 app

Community
  • 1
  • 1
Vinko Vrsalovic
  • 330,807
  • 53
  • 334
  • 373
  • i dont think this will be possible , since replacing the < and > in each node will actually replace the child nodes and in the end i will be having a single body with big string of escaped childs – Karim Dec 20 '09 at 21:14
  • Nope, that won't happen as the tree is built based on recognized tags, and the actual tags are not modified with the node values. But feel free to use a more tedious and error prone approach :) – Vinko Vrsalovic Dec 21 '09 at 06:40
0

A slow way to do it would be to treat each HTML file as an XML file. Then parse through each one of the nodes of that XML file and do a Server.HTMLEnocde on the contents of the node. Since HTML is just a defined set of XML this should work.

Avitus
  • 15,640
  • 6
  • 43
  • 53
  • this wnt be possible since this html wont be considered as a valid xml. even if using tools like htmlagility pack then its not valid since it will treat this unescaped < as tags – Karim Dec 20 '09 at 21:13