0

I am trying to parse a table in the XML file defined by its HTML tags and generate a word document. The table structure and the content should be automatically generated in the word document. In order to parse XML with java, I am taking help of the Apache poi library. When I retrieve the values from the XML I don't see the HTML tags that are present or associated with the table structure. However without the corresponding tags in the XML I cannot create a corresponding table int the word document. How should I proceed in that case?

The XML that I am parsing has one field with values that are arranged in a table structure.

<customfield id="9999" key="com.atlassian.jira.plugin.system.customfieldtypes:textarea">
  <customfieldname>Product</customfieldname>
       <customfieldvalues>
          <customfieldvalue>
    &lt;div class=&apos;table-wrap&apos;&gt;
    &lt;table class=&apos;conTable&apos;&gt;&lt;tbody&gt;
    &lt;tr&gt;
    &lt;td class=&apos;confluenceTd&apos;&gt;&lt;ul&gt;
        &lt;li&gt;Product1:&lt;/li&gt;
    &lt;/ul&gt;
    &lt;/td&gt;
    &lt;td class=&apos;confluenceTd&apos;&gt;&lt;ul&gt;
        &lt;li&gt;Product2:&lt;/li&gt;
    &lt;/ul&gt;
    &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
    &lt;td class=&apos;confluenceTd&apos;&gt;&lt;ul&gt;
        &lt;li&gt;Product3;/li&gt;
        &lt;li&gt;Product4&lt;/li&gt;
    &lt;/ul&gt;
    &lt;/td&gt;
    &lt;td class=&apos;confluenceTd&apos;&gt;&lt;ul&gt;
        &lt;li&gt;Product5&lt;/li&gt;
        &lt;li&gt;Product6&lt;/li&gt;
    &lt;/ul&gt;
    &lt;/td&gt;
    &lt;/tr&gt;
    &lt;/tbody&gt;&lt;/table&gt;
    &lt;/div&gt;
         </customfieldvalue>
     </customfieldvalues>
  </customfield>

The corresponding HTML is as follows

> <customfieldvalues>
>     <customfieldvalue> <div class='table-wrap'> <table class='confluenceTable'><tbody> <tr> <td class='confluenceTd'><ul>
> <li>Product1:</li> </ul> </td> <td class='confluenceTd'><ul>
> <li>Product2:</li> </ul> </td> </tr> <tr> <td
> class='confluenceTd'><ul> <li>Product3</li> <li>Product4</li> </ul>
> </td> <td class='confluenceTd'><ul> <li>Product5</li>
> <li>Product6</li> </ul> </td> </tr> </tbody></table> </div>    
> </customfieldvalue> </customfieldvalues>

I have parsed the XML normally to retrieve its value

element.item(n).getChildNodes().item(0).getNodeValue()
Jeet
  • 359
  • 1
  • 6
  • 24
  • 2
    Does this answer your question? [How to unescape HTML character entities in Java?](https://stackoverflow.com/questions/994331/how-to-unescape-html-character-entities-in-java) For example, this will show you how to convert a string containing `<div class='table-wrap'>` to a string containing `
    `, and so on.
    – andrewJames Feb 10 '23 at 13:47
  • After using the htmlUnescape(source string)..when I do string.contains("") or string.contains(""), why is it always false? How can I retrieve the tags after applying htmlUnescape on the string? – Jeet Feb 10 '23 at 14:02
  • After unescaping HTML characters, you still end up with a string, not a HTML document. If you want to parse that string as HTML, you can use a tool which is designed for that, such as [JSoup](https://stackoverflow.com/q/1497946/12567365) or other similar libraries. – andrewJames Feb 10 '23 at 14:07
  • In fact (I didn't realize this before) Jsoup can handle the unescaping for you, also. – andrewJames Feb 10 '23 at 14:16

1 Answers1

1

Here is a basic demo using Jsoup.

It assumes you have already extracted the text content from your <customfieldvalue>...</customfieldvalue> element.

So, now you have a string containing:

&lt;div class=&apos;table-wrap&apos;&gt; ... &lt;/div&gt;

To extract that content as a HTML document using Jsoup:

boolean strictMode = true;
String unescapedString = Parser.unescapeEntities(escapedString, strictMode);
Element element = Jsoup.parse(unescapedString).body();

You can iterate through all the child elements of this containing element:

for (Element element : Jsoup.parse(unescapedString).body().children().select("*")) {
    System.out.println(element.nodeName() + " - " + element.ownText());
}

In this case, all I am doing is printing each element with any data it contains.

The output is:

div - 
table - 
tbody - 
tr - 
td - 
ul - 
li - Product1:
td - 
ul - 
li - Product2:
tr - 
td - 
ul - 
li - Product3;/li>
li - Product4
td - 
ul - 
li - Product5
li - Product6

Interestingly, you can see that there is some malformed escaped HTML in the original data:

&lt;li&gt;Product3;/li&gt;

Once you have full access to the data-as-HTML, you can build your Word table using POI in the ususal way.

andrewJames
  • 19,570
  • 8
  • 19
  • 51