3

I am trying to parse a HTML document using and I want to allow <table> tag but not allow <tbody>.

I have seen this link:

Jsoup parsing an Html file with a tbody tag

and I tried with

Whitelist whiteList = Whitelist.relaxed();

whiteList.addTags("table");
whiteList.addTags("font");

whiteList.addAttributes("table", "align");
whiteList.addAttributes("tr","align");

//whiteList.removeTags("tbody");

String html = "<table>"
    + "<tr align='top'>"
    + "<th><font>Link</th>"
    + "</tr>"
    + "</table>";

boolean valid = Jsoup.isValid(html, whiteList);

System.out.println(valid);

If I remove the commented line I am getting false.

Also changing it to:

Document document = Jsoup.parse(html,"",Parser.xmlParser());

doesn't have much of an effect.

Is there any workaround for this?

I want to allow <table> but not allow <tbody>.

PS - I have thought of checking for <tbody> before parsing but it is somehow not a very good solution I feel.

Community
  • 1
  • 1
Abi
  • 1,335
  • 2
  • 15
  • 28

1 Answers1

1

Simple answer: You can't do it with Jsoup right now.

Explanation:

According to the specs that the Jsoup parser (DOM builder) follows, it creates tbody elements automatically upon reading in a tr that is inside a table but not yet inside a tbody

Here is the relevant sectoin of the docu:

tr = table . insertRow( [ index ] )

Creates a tr element, along with a tbody if required, inserts them into the table at the position given by the argument, and returns the tr.

It follows that Jsoup internally will create the tbody element for HTML content. If you use the XML parser, that tbody element is not created. Unfortunately the Whilelist feature only works on HTML, not on XML. You may issue a request for an implementation of the whiltelist feature for XML on the JSoup issue list.

luksch
  • 11,497
  • 6
  • 38
  • 53