Java close unclosed img tag

Question

I've a Java webserver that gets html from a REST service. I try to work on it using SAXParser which tells me that tags like img or area needs to be closed. Unfortunately I get img tags like this

<img src="https://..." style="width: 600px; height: 676px;">

Which is fine for browsers but not for my Parser. I use this on my content before parsing it

replaceAll("<\\s*([^\\s>]+)([^>]*)/\\s*>", "<$1$2></$1>").replaceAll("<\\s*(img|area)+((\"[^\"]*\"|[^>/])*)(?<!/)\\s*>", "<$1$2></$1>")

The first part is converting self closed tags to "real" closed tags. The second should close unclosed tags lke the img or area in my case.

I testet it here with some examples Test Results

It seems to work quite fine, but if the img is already closed its closed again

<area clas="" href=">" > </area> -->    <area clas="" href=">" ></area> </area>

Which I can't understand right now. Could you help me? Maybe I even need to generalize it a bit more??

UPDATE: I know that it's not right to use regex for html, however I need to send this to a piece of Code that I'm not allowed to change which is using XSLTransformation and there I get SAXParser error on selfclosing and unclosed tags. Is it possible to use jsoup to convert all unclosed or self-closing tags into closing ones and get that as output?

UPDATE: Terrible... Obviously everything works with JSOUP.

Document doc = Jsoup.parse(content);
// Some additional cleanups
this.parentContent = doc.select("body").html();

And I get my HTML... I was just thinking to complex :-(

[Don't use regex on XML/HTML!](http://stackoverflow.com/a/1732454/418066) — Biffen, Aug 21 '14 at 08:24
The SAXParser is an XML parser. I would try an HTML parser like Jsoup (http://jsoup.org/) that has the same parsing behaviour as common browsers. — Spectre, Aug 21 '14 at 08:24

score 3 · Accepted Answer · answered Aug 21 '14 at 08:23

3

HTML and XML are not interchangeable formats, and you might see a whole bunch of different problems pop up if you try to shoehorn it.

I would suggest using a HTML parser (maybe http://jsoup.org/ ) instead of a SAX one in order to parse HTML.

answered Aug 21 '14 at 08:23

Alexander Kjäll

4,246
3
33
57

score 0 · Answer 2 · answered Aug 21 '14 at 08:28

0

Add a lookahead and test if it is closed.If not then apply closure.

(?=regex1)regex2

Process regex2 only if regex1 matches.

answered Aug 21 '14 at 08:28

vks

67,027
10
91
124

Java close unclosed img tag

2 Answers2