2

I'm parsing some pretty bad html code. I've had good success, until I noticed that with some elements, the attributes contain "<".

Ex:

<a href="#Anchor-<ht-42368">40</a>

will result as

<a href="#Anchor-">
    <ht-42368>40</ht-42368>
</a>

This will render fine in the browser, but HTML cleaner will think it is trying to start a new tag. It adds a '">" before beginning a new tag, which I don't want.

What is the best way to fix this? I'm not sure if HTMLCleaner has any properties that I can configure to manage this.. if not, how should I preprocess the HTML data to fix these characters?

EDIT: fixed example

EDIT: I'm thinking I could apply a replaceAll() with a regex, before going into htmlcleaner. Maybe something like ="[^"]*" and search if it contains "<".. and if it does, replace with an escaped html ampersand. Would that work?

TDash
  • 45
  • 5
  • In your first example, the href attribute has no closing quote -- is that the way it is in the html you're dealing with? – James Dunn Aug 14 '13 at 14:22
  • Also your result has a hyphen in "ht-42368", but your example does not. – Jason C Aug 14 '13 at 14:24
  • You're right, my example was wrong. I accidentally omitted those when I removed unnecessary code. I updated the example – TDash Aug 14 '13 at 14:35
  • I'd just report a bug to HTMLCleaner and/or try a different HTML parser. There are countless different HTML parsers in Java, each with a different degree of lenientness and auto-fixing of bad HTML. My favourite, Jsoup, didn't have any problems with the HTML snippet in your question; the bad `<` became a `<`. See also http://stackoverflow.com/questions/3152138/what-are-the-pros-and-cons-of-the-leading-java-html-parsers/3154281#3154281 I would absolutely not fall back to regex in a headless attempt to workaround the problem. HTML parsers do not exist without reason. – BalusC Aug 14 '13 at 15:05
  • Thanks for the idea, I tried Jsoup and it's better for what I want to do with it anyway. Cheers. – TDash Aug 14 '13 at 17:20
  • Use a regular expression that's not fooled by the angle bracket: <.*?> – Lonnie Best Sep 29 '13 at 09:25

0 Answers0