For example if I have this html:
<div>this is a test < text</div>
the < after the test is an error and the right html should be
<div>this is a test < text</div>
But I have a lot of html files that by error were not encoded and i need fix this error so i can parse them later. The original source of data is not available so the only option is to fix this html I have.
Well, the sames applies to the > character and to text that has both < and > characters Like "<2000> - <2004>". I would like to hear ideas for algorithms or libraries that can help me. Thanks.
Note: the html sample above is a sample and the work should be done on big html files.