-1

I've a Java webserver that gets html from a REST service. I try to work on it using SAXParser which tells me that tags like img or area needs to be closed. Unfortunately I get img tags like this

<img src="https://..." style="width: 600px; height: 676px;">

Which is fine for browsers but not for my Parser. I use this on my content before parsing it

replaceAll("<\\s*([^\\s>]+)([^>]*)/\\s*>", "<$1$2></$1>").replaceAll("<\\s*(img|area)+((\"[^\"]*\"|[^>/])*)(?<!/)\\s*>", "<$1$2></$1>")

The first part is converting self closed tags to "real" closed tags. The second should close unclosed tags lke the img or area in my case.

I testet it here with some examples Test Results

It seems to work quite fine, but if the img is already closed its closed again

<area clas="" href=">" > </area> -->    <area clas="" href=">" ></area> </area>

Which I can't understand right now. Could you help me? Maybe I even need to generalize it a bit more??

UPDATE: I know that it's not right to use regex for html, however I need to send this to a piece of Code that I'm not allowed to change which is using XSLTransformation and there I get SAXParser error on selfclosing and unclosed tags. Is it possible to use jsoup to convert all unclosed or self-closing tags into closing ones and get that as output?

UPDATE: Terrible... Obviously everything works with JSOUP.

Document doc = Jsoup.parse(content);
// Some additional cleanups
this.parentContent = doc.select("body").html();

And I get my HTML... I was just thinking to complex :-(

Hons
  • 3,804
  • 3
  • 32
  • 50

2 Answers2

3

HTML and XML are not interchangeable formats, and you might see a whole bunch of different problems pop up if you try to shoehorn it.

I would suggest using a HTML parser (maybe http://jsoup.org/ ) instead of a SAX one in order to parse HTML.

Alexander Kjäll
  • 4,246
  • 3
  • 33
  • 57
0

Add a lookahead and test if it is closed.If not then apply closure.

(?=regex1)regex2

Process regex2 only if regex1 matches.

vks
  • 67,027
  • 10
  • 91
  • 124