html attributes cleaning with java

Question

I have a task from school to remove everything from html tags except on a few attributes like class, id, alt, src, name and href.

For example, we have a HTML file:

<div class="wrapper">
<h1 value="something" class=header>Header</h1>
<div id="article1" class="article" name="something" >
<img clsas="mistake" src="picture.jpg" id="pict1" class="image_article" alt="picture" />
<p class="article_text" >Lorem ipsum dolor sit amet, consectetur adipiscing. </p>
<a href="article.html" title="More">Více</a>
</div>

And the result should be like this:

<div class="wrapper">
<h1 class=header>Header</h1>
<div id="article1" class="article" >
<img src="picture.jpg" id="pict1" class="image_article" alt="picture" />
<p class="article_text" >Lorem ipsum dolor sit amet, consectetur adipiscing. </p>
<a href="article.html">Více</a>
</div>

I tried something like this:

String opr = html.replaceAll("<([a-zA-Z]+)[^<>]*(class|id)(=\".+?\")[^<]*(class|id)(=\".+?\")[^<]*>", "<$1 $2$3 $4$5 >");

But it only works on HTML tags that are both of attributes class and id. Can someone help please?

they told you to do so with `regex`? If so, tell them that this idea is plain stupid because that´s not what `regex` is supposed to do. Use a proper html parser and work with it in order to manipulate the html. — SomeJavaGuy, Jun 02 '16 at 14:52
Even if it´s just an example, show them [the accepted answer to the question](http://stackoverflow.com/questions/677038/how-to-use-regular-expressions-to-parse-html-in-java), regex is not the tool to manipulate `html` with — SomeJavaGuy, Jun 02 '16 at 14:55

Nicolas Filotto · Accepted Answer · 2016-06-03T08:35:11.753

Avoid regular expressions for such need, as it will be very complex if you want to have it right, so it would be hard to maintain. You should use an HTML parser instead like Jsoup then clean up each element by removing all the unwanted attributes as next:

Document doc = Jsoup.parse("<html>\n" +
    " <head></head>\n" +
    " <body>\n" +
    "<table><div class=\"wrapper\">\n" +
    "<h1 value=\"something\" class=header>Header</h1>\n" +
    "<div id=\"article1\" class=\"article\" name=\"something\" >\n" +
    "<img clsas=\"mistake\" src=\"picture.jpg\" id=\"pict1\" class=\"image_article\" alt=\"picture\" />\n" +
    "<p class=\"article_text\" >Lorem ipsum dolor sit amet, consectetur adipiscing. </p>\n" +
    "<a href=\"article.html\" title=\"More\">Více</a>\n" +
    "</div></body></html>"
);
for (Element element : doc.getAllElements()) {
    for (Attribute attribute : element.attributes()) {
        switch (attribute.getKey()) {
            case "class":
            case "id":
            case "alt":
            case "src":
            case "name":
            case "href":
                break;
            default:
                element.removeAttr(attribute.getKey());
        }
    }
}
System.out.println(doc);

Output:

<html>
 <head></head> 
 <body> 
  <div class="wrapper"> 
   <h1 class="header">Header</h1> 
   <div id="article1" class="article" name="something"> 
    <img src="picture.jpg" id="pict1" class="image_article" alt="picture"> 
    <p class="article_text">Lorem ipsum dolor sit amet, consectetur adipiscing. </p> 
    <a href="article.html">Více</a> 
   </div>
  </div>
  <table></table>
 </body>
</html>

html attributes cleaning with java

1 Answers1