1

I'm using JSoup 1.6.2. I have this HTML ...

 <a title="Subscribe to RSS feeds" href="http://domain/city/RSS" style="float:right; margin-left:10px;""> 

Notice the stray quote right before the end of the tag. I was hoping JSoup could clean that up somehow. I try and make everything right by running ...

final org.jsoup.nodes.Document doc = Jsoup.parse(html);

The result is

  <a title="Subscribe to RSS feeds" href="http://domain/city/RSS" style="float:right; margin-left:10px;" "="">

which is still not well-formed. Is there a way I can take the badly formed HTMl and make it well-formed with JSoup? Barring that, is there another HTML tidy-upper that can do the job for the above example but also allow me to access the resulting HTML as either a String or a org.w3c.dom.Document object?

Dave
  • 15,639
  • 133
  • 442
  • 830

2 Answers2

0

Can you just use a regular expression replace to fix this? I'm not sure how to do it in Java, but in JavaScript it would be something like this:

var str = '<a title="Subscribe to RSS feeds" href="http://domain/city/RSS" style="float:right; margin-left:10px;"">';

var newStr = str.replace(/""/,'"');
//<a title="Subscribe to RSS feeds" href="http://domain/city/RSS" style="float:right; margin-left:10px;">
Bryan Downing
  • 15,194
  • 3
  • 39
  • 60
0

Based on this answer, I will recommend you to use JTidy for "tidy" the HTML source.

Community
  • 1
  • 1
vacuum
  • 2,273
  • 3
  • 20
  • 32
  • Thanks. JTidy works in my situation. I was looking at their web site, the last release seems to be from 2007. Does that mean the project has died? – Dave Apr 17 '12 at 18:29