Parsing HTML content from XML file

Question

    <xbrli:xbrl xmlns:aoi="http://www.aointl.com/20160331" xmlns:country="http://xbrl.sec.gov/country/2016-01-31" xmlns:currency="http://xbrl.sec.gov/currency/2016-01-31" xmlns:dei="http://xbrl.sec.gov/dei/2014-01-31" xmlns:exch="http://xbrl.sec.gov/exch/2016-01-31" xmlns:invest="http://xbrl.sec.gov/invest/2013-01-31" xmlns:iso4217="http://www.xbrl.org/2003/iso4217" xmlns:link="http://www.xbrl.org/2003/linkbase" xmlns:naics="http://xbrl.sec.gov/naics/2011-01-31" xmlns:nonnum="http://www.xbrl.org/dtr/type/non-numeric" xmlns:num="http://www.xbrl.org/dtr/type/numeric" xmlns:ref="http://www.xbrl.org/2006/ref" xmlns:sic="http://xbrl.sec.gov/sic/2011-01-31" xmlns:stpr="http://xbrl.sec.gov/stpr/2011-01-31" xmlns:us-gaap="http://fasb.org/us-gaap/2016-01-31" xmlns:us-roles="http://fasb.org/us-roles/2016-01-31" xmlns:us-types="http://fasb.org/us-types/2016-01-31" xmlns:utreg="http://www.xbrl.org/2009/utr" xmlns:xbrldi="http://xbrl.org/2006/xbrldi" xmlns:xbrldt="http://xbrl.org/2005/xbrldt" xmlns:xbrli="http://www.xbrl.org/2003/instance" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <link:schemaRef xlink:href="aoi-20160331.xsd" xlink:type="simple"/>
    <xbrli:context id="FD2016Q4YTD">
    <xbrli:entity>
    <xbrli:identifier scheme="http://www.sec.gov/CIK">0000939930</xbrli:identifier>
    </xbrli:entity>
    <xbrli:period>
    <xbrli:startDate>2015-04-01</xbrli:startDate>
    <xbrli:endDate>2016-03-31</xbrli:endDate>
    </xbrli:period>
    </xbrli:context>

    <aoi:OtherIncomeAndExpensePolicyTextBlock contextRef="FD2016Q4YTD" id="Fact-F51C7616E17E5B8B0B770D410BBF5A3E">
    <div style="font-family:Times New Roman;font-size:10pt;"><div style="line-height:120%;text-align:justify;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;font-weight:bold;">Other Income (Expense)</font></div><div style="line-height:120%;text-align:justify;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;"></font></div></div>
    </aoi:OtherIncomeAndExpensePolicyTextBlock>
    </xbrli:xbrl>

This is My XML[XBRL], i need to parse this. This xml is my input and i don't know whether its a valid or not but in need output like this :

    <div style="font-family:Times New Roman;font-size:10pt;"><div style="line-height:120%;text-align:justify;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;font-weight:bold;">Other Income (Expense)</font></div><div style="line-height:120%;text-align:justify;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;"></font></div></div>

Please someone share me the knowledge for this problem i am facing from last two weeks.

this is the code i am using 

    File fXmlFile = new File("/home/devteam-user1/Desktop/ky/UnitTesting.xml");
                DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
                DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
                Document doc = dBuilder.parse(fXmlFile);

                XPath xPath =  XPathFactory.newInstance().newXPath();
                final String DIV_UNDER_ROOT = "/*/aoi";
                NodeList divList = (NodeList)xPath.compile(DIV_UNDER_ROOT)
                        .evaluate(doc, XPathConstants.NODESET);
                System.out.println(divList.getLength());
                for (int i = 0; i < divList.getLength() ; i++) {  // just in case there is more than one
                    Node divNode = divList.item(i);
                    System.out.println(nodeToString(divNode));

//nodeToString method below 

    private static String nodeToString(Node node) throws Exception
        {
            Transformer transformer = TransformerFactory.newInstance().newTransformer();
            transformer.setOutputProperty(OutputKeys.INDENT, "yes");
            StreamResult result = new StreamResult(new StringWriter());
            transformer.transform(new DOMSource(node), result);
            return result.getWriter().toString();
        }

I don't understand very well, but if you need to incorporate HTML inside XML you should escape the characters. For example Hello World will be output as <b> Hello World </b> Or use a <![CDATA[ ]]> block — Marco A. Hernandez, Jul 18 '16 at 11:38
@marco i dont need to insert html into xml.its already there in xml.i need to get that html content by using any java api. in my question i have clearly mentioned my INPUT and OUTPUT — John Adam, Jul 18 '16 at 11:44
Use an XML parser to extract the XML information by XML tag. Keep the HTML. — Gilbert Le Blanc, Jul 18 '16 at 11:52
But your XML document as a whole is well formed? No missing end tag in the HTML part? — vanje, Jul 18 '16 at 11:54
@Gilbert i have tried so many Api/parser ... if you know how to parse above xml please share the Code — John Adam, Jul 18 '16 at 11:55
As long as your xml is well formed you can use use SAX Parser, DOM Parser, JAXB, etc. But you must ensure that is well formed and the easiest way, if you have HTML code, is escaping the text. — Marco A. Hernandez, Jul 18 '16 at 11:56
@marco i ma not able to get that HTML content form all the API you told just Now. please can you share the code If possible. — John Adam, Jul 18 '16 at 11:58
I second Marco A. Hernandez. I don't see the problem to parse the XML and extract the parts you are interested in. Maybe you should show your code and explain in more detail what your exact problem is. — vanje, Jul 18 '16 at 11:59
@all, i have added my code above which is getting only HTML content when XML is having only one tag i e if more than one tag is present i am getting Exception — John Adam, Jul 18 '16 at 13:31
You can parse the XML here and just grab the node you require. The HTML is still well formed XML. — ManoDestra, Jul 18 '16 at 13:31
did you try jsoup? it's the best html/xml parser for java https://jsoup.org/ available in maven too — wutzebaer, Jul 18 '16 at 13:35
@all if you have any Code that do the work please Share here... — John Adam, Jul 18 '16 at 13:42
this is a question part 2 of another question from the same person . I gave fulll answer there, so he copy/paste my answer into new question. is this proper behavior in this forum ?!?!?! http://stackoverflow.com/questions/38366988/xmlxbrl-tags-having-html-content-how-to-parse-it/38418677#38418677 — Sharon Ben Asher, Jul 18 '16 at 13:51

score 0 · Answer 1 · answered Jul 18 '16 at 13:50

0

this works well for me

public static void main(String[] args) throws IOException {
    FileInputStream fis = new FileInputStream("yourfile.xml");
    Document doc = Jsoup.parse(Utils.streamToString(fis));
    System.out.println(doc.select("aoi|OtherIncomeAndExpensePolicyTextBlock").html().toString());
}

answered Jul 18 '16 at 13:50

wutzebaer

14,365
19
99
170

@sharonbn : i hope here we are trying get the knowledge from a GiantKnowledge people like you. u told me its not well formed XML. n also i am new to this Forum as i understood if my question is proper then i will get the solution easy... not Doing anything apart from that.. Thanks – John Adam Jul 18 '16 at 13:55
@JohnAdam - DO NOT EDIT answers to your question! DO NOT OPEN NEW QUESTIONS copy-paste answers from previous question!! this is not how to behave to people trying to help you !! – sharonb – Sharon Ben Asher Jul 18 '16 at 14:00
this is really just basic courtesy to give credit when you copy-paste from someone elses answer – Sharon Ben Asher Jul 18 '16 at 14:00
@wutzebaer, Utils? doc.select? can u explain more on this? – John Adam Jul 19 '16 at 10:24
Utils.streamToString is a utils funcion of mine which converts a stream into a string and doc.select see here: https://jsoup.org/apidocs/org/jsoup/nodes/Element.html#select-java.lang.String- – wutzebaer Jul 19 '16 at 10:58
1

@wutzebaer, thank you so much for your code.. its working fine. – John Adam Jul 20 '16 at 11:10

score 0 · Answer 2 · edited May 23 '17 at 12:14

Your main issue lies with

final String DIV_UNDER_ROOT = "/*/aoi";

Which is an XPath expression that matches "any node 2 levels under the root, which has a local name of aoi and no namespace". This is not what you want.

You want to match any contents of a node that is two levels deep, whose namespace is aliased by "aoi" (which means it belongs to the "http://www.aointl.com/20160331" namespace), and whose local name is "OtherIncomeAndExpensePolicyTextBlock".

Matching namespaces in XPath in Java is quiet cumbersome (see XPath with namespace in Java and How to query XML using namespaces in Java with XPath?), but long story short, you could try this way instead :

final String DIV_UNDER_ROOT = "//*[local-name()='OtherIncomeAndExpensePolicyTextBlock' and namespace-uri()='http://www.aointl.com/20160331']/*";

This will only work if your DocumentBuilderFactory is made namespace aware, so you should make sure by configuring it like so above :

DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
dbFactory.setNamespaceAware(true);

OP should invest the time to learm to use XPATH tool and syntax. that is all — Sharon Ben Asher, Jul 18 '16 at 14:04

Parsing HTML content from XML file

2 Answers2