0

I am parsing xml file. It is working for some file and for some it is not.

My code is:

public static String parseXml(String xmlFileName) {
    StringBuilder docText = new StringBuilder();

    try {
        DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
       domFactory.setNamespaceAware(true);
       //domFactory.setValidating(false);
        DocumentBuilder builder = domFactory.newDocumentBuilder();

        builder.setEntityResolver(new EntityResolver() {
            @Override
            public InputSource resolveEntity(String publicId, String systemId)
                    throws SAXException, IOException {
                if (systemId.contains("pdf2xml.dtd")) {
                    return new InputSource(
                            new ByteArrayInputStream("<?xml version='1.0' encoding='UTF-8'?>".getBytes()));
                } else
                    return null;
            }
        });
        System.out.println("File is : " + xmlFileName);
        Document doc = builder.parse(new FileInputStream(xmlFileName));
        System.out.println("root of xml file" + doc.getDocumentElement().getNodeName());
        NodeList nodes = doc.getElementsByTagName("text");
        /**Do Something here*/
       }

I tried by disabling validation by domFactory.setValidating(false); and it does not work.

I checked the xml file it looks fine to me with all tag properly closed (though I am newbie in xml).

StackTrace:

[Fatal Error] :210:67: XML document structures must start and end within the same entity. org.xml.sax.SAXParseException; lineNumber: 210; columnNumber: 67; XML document structures must start and end within the same entity. at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)

Here's the xml content:.

    <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">

<pdf2xml producer="poppler" version="0.34.0">
<page number="1" position="absolute" top="0" left="0" height="1188" width="918">
    <fontspec id="0" size="13" family="Times" color="#4c4c4c"/>
    <fontspec id="1" size="34" family="Times" color="#4c4c4c"/>
    <fontspec id="2" size="28" family="Times" color="#4c4c4c"/>
    <fontspec id="3" size="19" family="Times" color="#4c4c4c"/>
    <fontspec id="4" size="16" family="Times" color="#4c4c4c"/>
<image top="52" left="45" width="828" height="461" src="./app/utils/resume/pdf/DISC-Aditya_Thakur-1_1.jpg"/>
<image top="1009" left="45" width="225" height="96" src="./app/utils/resume/pdf/DISC-Aditya_Thakur-1_2.jpg"/>
<text top="1140" left="45" width="415" height="15" font="0">Copyright 2016 Innermetrix Incorporated • All rights reserved</text>
<text top="416" left="45" width="243" height="36" font="1"><b>Aditya Thakur</b></text>
<text top="457" left="45" width="179" height="30" font="2">May 25, 2016</text>
<text top="551" left="45" width="747" height="21" font="3">This Innermetrix Disc Index is a modern interpretation of Dr. William Marston's</text>
<text top="578" left="45" width="770" height="21" font="3">behavioral dimensions. Marston's research uncovered four quadrants of behavior</text>
<text top="606" left="45" width="809" height="21" font="3">which help to understand a person's behavioral preferences.  This Disc Index will help</text>
<text top="633" left="45" width="703" height="21" font="3">you understand your behavioral style and how to maximize your potential.</text>
<text top="1027" left="293" width="217" height="18" font="4">Anthony Robbins Coaching</text>
<text top="1055" left="293" width="183" height="18" font="4">www.tonyrobbins.com</text>
<text top="1084" left="293" width="5" height="18" font="4"> </text>
</page>
<page number="2" position="absolute" top="0" left="0" height="1188" width="918">
    <fontspec id="5" size="22" family="Times" color="#1c8cc4"/>
    <fontspec id="6" size="22" family="Times" color="#303030"/>
    <fontspec id="7" size="13" family="Times" color="#000000"/>
    <fontspec id="8" size="25" family="Times" color="#4c4c4c"/>
    <fontspec id="9" size="14" family="Times" color="#7f7f7f"/>
    <fontspec id="10" size="40" family="Times" color="#ffffff"/>
    <fontspec id="11" size="16" family="Times" color="#000000"/>
    <fontspec id="12" size="14" family="Times" color="#4c4c4c"/>
    <fontspec id="13" size="14" family="Times" color="#4c4c4c"/>
<image top="18" left="30" width="83" height="83" src="./app/utils/resume/pdf/DISC-Aditya_Thakur-2_1.png"/>
<text top="46" left="128" width="174" height="24" font="5"><b>The DISC Index</b></text>
<text top="46" left="317" width="228" height="24" font="6"><b>Executive Summary</b></text>
<text top="551" left="891" width="0" height="15" font="7">Aditya Thakur</text>
<text top="1140" left="45" width="415" height="15" font="0">Copyright 2016 Innermetrix Incorporated • All rights reserved</text>
<text top="1140" left="865" width="8" height="15" font="7">2</text>
<text top="152" left="196" width="526" height="27" font="8"><b>Natural and Adaptive Styles Comparison</b></text>
<text top="590" left="44" width="26" height="17" font="9">    0</text>
<text top="556" left="44" width="27" height="17" font="9">  10</text>
<text top="521" left="44" width="27" height="17" font="9">  20</text>
<text top="487" left="44" width="27" height="17" font="9">  30</text>
<text top="452" left="44" width="27" height="17" font="9">  40</text>
<text top="418" left="44" width="27" height="17" font="9">  50</text>
<text top="383" left="44" width="27" height="17" font="9">  60</text>
<text top="349" left="44" width="27" height="17" font="9">  70</text>
<text top="314" left="44" width="27" height="17" font="9">  80</text>
<text top="280" left="44" width="27" height="17" font="9">  90</text>
<text top="245" left="44" width="27" height="17" font="9">100</text>
<text top="620" left="156" width="29" height="42" font="10"><b>D</b></text>
<text top="675" left="147" width="56" height="18" font="11">56 / 77</text>
<text top="620" left="359" width="16" height="42" font="10"><b>I</b></text>
<text top="675" left="343" width="56" height="18" font="11">53 / 67</text>
<text top="620" left="552" width="22" height="42" font="10"><b>S</b></text>

//****************Line no 200 starts*****************//
<text top="618" left="320" width="72" height="18" font="11">Inspiring</text>
<text top="652" left="307" width="97" height="18" font="11">Enthusiastic</text>
<text top="686" left="322" width="67" height="18" font="11">Sociable</text>
<text top="721" left="329" width="54" height="18" font="11">Poised</text>
<text top="755" left="316" width="79" height="18" font="11">Charming</text>
<text top="789" left="311" width="89" height="18" font="11">Convincing</text>
<text top="823" left="317" width="78" height="18" font="11">Reflective</text>
<text top="857" left="299" width="112" height="18" font="11">Matter-of-fact</text>
<text top="892" left="311" width="88" height="18" font="11">Withdrawn</text>
<text top="926" left="333" width="46" height="18" font="14"><b>Aloof</b></text>
<text top="999" left="328" width="54" height="21" font="19"><b>Low I</b></text>
<text top="242" left="495" width="134" height="27" font="16"><b>Stabilizing</b></text>
<text top="305" left="536" width="53" height="21" font="17"><b>Pace:</b></text>
<text top="351" left="474" width="176" height="18" font="11">How you tend to pace</text>
<text top="373" left="507" width="111" height="18" font="11">things in your</text>
<text top="394" left="510" width="104" height="18" font="11">environment</text>
<text top="462" left="531" width="63" height="21" font="20"><b>High S</b></text>
<text top="550" left="531" width="63" height="18" font="14"><b>Patient</b></text>
<text top="584" left="517" width="91" height="18" font="11">Predictable</text>
<text top="618" left="533" width="59" height="18" font="11">Passive</text>
<text top="652" left="514" width="97" height="18" font="11">Complacent</text>
//****************Line no 210 ends*****************//
//**********Last 5 line *************************//
<text top="778" left="45" width="792" height="18" font="11">___________________________________________________________________________________________________________</text>
<text top="806" left="45" width="792" height="18" font="11">___________________________________________________________________________________________________________</text>
<text top="835" left="45" width="792" height="18" font="11">___________________________________________________________________________________________________________</text>
</page>
</pdf2xml>

Line 210 is - <text top="999" left="328" width="54" height="21" font="19"><b>Low I</b></text>

Thanks in advance.

Om Prakash
  • 2,675
  • 4
  • 29
  • 50
  • 1
    The (updated) XML in your post is well-formed, assuming that you remove the leading space from the XML declaration. The error message that you're receiving is complaining about an unclosed delimiter or tag and doesn't match the XML you've posted. Check your assumption that this XML is really the XML that's yielding the error. – kjhughes May 31 '17 at 13:46
  • Above xml is automatically generated using `pdftohtml` converter from **Pdf**. So, I assume all the tag is properly closed with no leading space. Please correct me, if I understand wrong. – Om Prakash May 31 '17 at 13:49
  • I used command to generate xml from pdf - `pdftohtml -xml ` – Om Prakash May 31 '17 at 13:54
  • The problem with your code is the use of 'fileName'. You don´t know which file is read, you only know the name of the file - not the directory. Instead of using the fileName as a parameter to the constructor of FileInputStream please create a new File with the filename. You could than print the absolutePath of the new File instead of the filename. Think of using a special directory or a class-path-resource instead of a file. – Tobias Otto May 31 '17 at 13:54
  • @TobiasOtto, I tested your approach and it is able to detect the file i.e. no `FileNotFoundExcepton`. Btw, above code is working for some file and for some it is not. All the files are in same directory. Here fileName is passed with its full relative path e.g. `String fileName = "./app/utils/resume/pdf/sivaadhithya.pdf";` Pdf file is converted and stored in same directory i.e. relative path remains same, only file extension changes. Then `parseXml()` method is invoked to parse xml. – Om Prakash May 31 '17 at 14:16
  • 1
    Sorry, but you did not get my point. I have to presume that your program reads the wrong file, because it uses a relative path. So you do not know which file is parsed. – Tobias Otto May 31 '17 at 14:25
  • I don't like the look of `return new InputSource(new ByteArrayInputStream("".getBytes()));`. What makes you think that getBytes() will give you a UTF-8 encoding? It's probably OK because there are no non-ASCII characters, but do some tests to eliminate this as a cause of problems. – Michael Kay May 31 '17 at 16:22

0 Answers0