I've a Java webserver that gets html from a REST service. I try to work on it using SAXParser which tells me that tags like img or area needs to be closed. Unfortunately I get img tags like this
<img src="https://..." style="width: 600px; height: 676px;">
Which is fine for browsers but not for my Parser. I use this on my content before parsing it
replaceAll("<\\s*([^\\s>]+)([^>]*)/\\s*>", "<$1$2></$1>").replaceAll("<\\s*(img|area)+((\"[^\"]*\"|[^>/])*)(?<!/)\\s*>", "<$1$2></$1>")
The first part is converting self closed tags to "real" closed tags. The second should close unclosed tags lke the img or area in my case.
I testet it here with some examples
It seems to work quite fine, but if the img is already closed its closed again
<area clas="" href=">" > </area> --> <area clas="" href=">" ></area> </area>
Which I can't understand right now. Could you help me? Maybe I even need to generalize it a bit more??
UPDATE: I know that it's not right to use regex for html, however I need to send this to a piece of Code that I'm not allowed to change which is using XSLTransformation and there I get SAXParser error on selfclosing and unclosed tags. Is it possible to use jsoup to convert all unclosed or self-closing tags into closing ones and get that as output?
UPDATE: Terrible... Obviously everything works with JSOUP.
Document doc = Jsoup.parse(content);
// Some additional cleanups
this.parentContent = doc.select("body").html();
And I get my HTML... I was just thinking to complex :-(