I am parsing xml file. It is working for some file and for some it is not.
My code is:
public static String parseXml(String xmlFileName) {
StringBuilder docText = new StringBuilder();
try {
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true);
//domFactory.setValidating(false);
DocumentBuilder builder = domFactory.newDocumentBuilder();
builder.setEntityResolver(new EntityResolver() {
@Override
public InputSource resolveEntity(String publicId, String systemId)
throws SAXException, IOException {
if (systemId.contains("pdf2xml.dtd")) {
return new InputSource(
new ByteArrayInputStream("<?xml version='1.0' encoding='UTF-8'?>".getBytes()));
} else
return null;
}
});
System.out.println("File is : " + xmlFileName);
Document doc = builder.parse(new FileInputStream(xmlFileName));
System.out.println("root of xml file" + doc.getDocumentElement().getNodeName());
NodeList nodes = doc.getElementsByTagName("text");
/**Do Something here*/
}
I tried by disabling validation by domFactory.setValidating(false);
and it does not work.
I checked the xml file it looks fine to me with all tag properly closed (though I am newbie in xml).
StackTrace:
[Fatal Error] :210:67: XML document structures must start and end within the same entity. org.xml.sax.SAXParseException; lineNumber: 210; columnNumber: 67; XML document structures must start and end within the same entity. at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
Here's the xml content:.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
<pdf2xml producer="poppler" version="0.34.0">
<page number="1" position="absolute" top="0" left="0" height="1188" width="918">
<fontspec id="0" size="13" family="Times" color="#4c4c4c"/>
<fontspec id="1" size="34" family="Times" color="#4c4c4c"/>
<fontspec id="2" size="28" family="Times" color="#4c4c4c"/>
<fontspec id="3" size="19" family="Times" color="#4c4c4c"/>
<fontspec id="4" size="16" family="Times" color="#4c4c4c"/>
<image top="52" left="45" width="828" height="461" src="./app/utils/resume/pdf/DISC-Aditya_Thakur-1_1.jpg"/>
<image top="1009" left="45" width="225" height="96" src="./app/utils/resume/pdf/DISC-Aditya_Thakur-1_2.jpg"/>
<text top="1140" left="45" width="415" height="15" font="0">Copyright 2016 Innermetrix Incorporated • All rights reserved</text>
<text top="416" left="45" width="243" height="36" font="1"><b>Aditya Thakur</b></text>
<text top="457" left="45" width="179" height="30" font="2">May 25, 2016</text>
<text top="551" left="45" width="747" height="21" font="3">This Innermetrix Disc Index is a modern interpretation of Dr. William Marston's</text>
<text top="578" left="45" width="770" height="21" font="3">behavioral dimensions. Marston's research uncovered four quadrants of behavior</text>
<text top="606" left="45" width="809" height="21" font="3">which help to understand a person's behavioral preferences. This Disc Index will help</text>
<text top="633" left="45" width="703" height="21" font="3">you understand your behavioral style and how to maximize your potential.</text>
<text top="1027" left="293" width="217" height="18" font="4">Anthony Robbins Coaching</text>
<text top="1055" left="293" width="183" height="18" font="4">www.tonyrobbins.com</text>
<text top="1084" left="293" width="5" height="18" font="4"> </text>
</page>
<page number="2" position="absolute" top="0" left="0" height="1188" width="918">
<fontspec id="5" size="22" family="Times" color="#1c8cc4"/>
<fontspec id="6" size="22" family="Times" color="#303030"/>
<fontspec id="7" size="13" family="Times" color="#000000"/>
<fontspec id="8" size="25" family="Times" color="#4c4c4c"/>
<fontspec id="9" size="14" family="Times" color="#7f7f7f"/>
<fontspec id="10" size="40" family="Times" color="#ffffff"/>
<fontspec id="11" size="16" family="Times" color="#000000"/>
<fontspec id="12" size="14" family="Times" color="#4c4c4c"/>
<fontspec id="13" size="14" family="Times" color="#4c4c4c"/>
<image top="18" left="30" width="83" height="83" src="./app/utils/resume/pdf/DISC-Aditya_Thakur-2_1.png"/>
<text top="46" left="128" width="174" height="24" font="5"><b>The DISC Index</b></text>
<text top="46" left="317" width="228" height="24" font="6"><b>Executive Summary</b></text>
<text top="551" left="891" width="0" height="15" font="7">Aditya Thakur</text>
<text top="1140" left="45" width="415" height="15" font="0">Copyright 2016 Innermetrix Incorporated • All rights reserved</text>
<text top="1140" left="865" width="8" height="15" font="7">2</text>
<text top="152" left="196" width="526" height="27" font="8"><b>Natural and Adaptive Styles Comparison</b></text>
<text top="590" left="44" width="26" height="17" font="9"> 0</text>
<text top="556" left="44" width="27" height="17" font="9"> 10</text>
<text top="521" left="44" width="27" height="17" font="9"> 20</text>
<text top="487" left="44" width="27" height="17" font="9"> 30</text>
<text top="452" left="44" width="27" height="17" font="9"> 40</text>
<text top="418" left="44" width="27" height="17" font="9"> 50</text>
<text top="383" left="44" width="27" height="17" font="9"> 60</text>
<text top="349" left="44" width="27" height="17" font="9"> 70</text>
<text top="314" left="44" width="27" height="17" font="9"> 80</text>
<text top="280" left="44" width="27" height="17" font="9"> 90</text>
<text top="245" left="44" width="27" height="17" font="9">100</text>
<text top="620" left="156" width="29" height="42" font="10"><b>D</b></text>
<text top="675" left="147" width="56" height="18" font="11">56 / 77</text>
<text top="620" left="359" width="16" height="42" font="10"><b>I</b></text>
<text top="675" left="343" width="56" height="18" font="11">53 / 67</text>
<text top="620" left="552" width="22" height="42" font="10"><b>S</b></text>
//****************Line no 200 starts*****************//
<text top="618" left="320" width="72" height="18" font="11">Inspiring</text>
<text top="652" left="307" width="97" height="18" font="11">Enthusiastic</text>
<text top="686" left="322" width="67" height="18" font="11">Sociable</text>
<text top="721" left="329" width="54" height="18" font="11">Poised</text>
<text top="755" left="316" width="79" height="18" font="11">Charming</text>
<text top="789" left="311" width="89" height="18" font="11">Convincing</text>
<text top="823" left="317" width="78" height="18" font="11">Reflective</text>
<text top="857" left="299" width="112" height="18" font="11">Matter-of-fact</text>
<text top="892" left="311" width="88" height="18" font="11">Withdrawn</text>
<text top="926" left="333" width="46" height="18" font="14"><b>Aloof</b></text>
<text top="999" left="328" width="54" height="21" font="19"><b>Low I</b></text>
<text top="242" left="495" width="134" height="27" font="16"><b>Stabilizing</b></text>
<text top="305" left="536" width="53" height="21" font="17"><b>Pace:</b></text>
<text top="351" left="474" width="176" height="18" font="11">How you tend to pace</text>
<text top="373" left="507" width="111" height="18" font="11">things in your</text>
<text top="394" left="510" width="104" height="18" font="11">environment</text>
<text top="462" left="531" width="63" height="21" font="20"><b>High S</b></text>
<text top="550" left="531" width="63" height="18" font="14"><b>Patient</b></text>
<text top="584" left="517" width="91" height="18" font="11">Predictable</text>
<text top="618" left="533" width="59" height="18" font="11">Passive</text>
<text top="652" left="514" width="97" height="18" font="11">Complacent</text>
//****************Line no 210 ends*****************//
//**********Last 5 line *************************//
<text top="778" left="45" width="792" height="18" font="11">___________________________________________________________________________________________________________</text>
<text top="806" left="45" width="792" height="18" font="11">___________________________________________________________________________________________________________</text>
<text top="835" left="45" width="792" height="18" font="11">___________________________________________________________________________________________________________</text>
</page>
</pdf2xml>
Line 210 is - <text top="999" left="328" width="54" height="21" font="19"><b>Low I</b></text>
Thanks in advance.