0

I have a text document in which each line is an entire US patent XML document. I am trying to parse it to remove certain features like the patent number, etc. I haven't used XPath before, so I'm borrowing some code I found from Ravi Thapliyal at Parse XML Simple String using Java XPath. However, apparently the initial !DOCTYPE tag is causing the DocumentBuilder to try to find the actual document somewhere?

Here is my first attempt at code:

//convert entire file to ArrayList of strings
        ArrayList<String> doc = new ArrayList<>();
        while(input.hasNext()){
            doc.add(input.nextLine().trim());
        }

int index = 0;
    while(index < doc.size()){
        String xml = doc.get(index);
        XPathFactory xpathFactory = XPathFactory.newInstance();
        XPath xPath = xpathFactory.newXPath();
        InputSource source = new InputSource(new StringReader(xml));

        db.setEntityResolver(new EntityResolver() {
            public InputSource resolveEntity(String publicId, String systemId)
             throws SAXException, java.io.IOException {
                if (systemId.contains("us-patent-grant-v40-2004-12-02.dtd")) {
            return new InputSource(new StringReader(""));
        } else {
            return null;
        }
            }
        });

        String orgName = "";
        try {
            orgName = (String) xPath.evaluate("/agents/adressbook/orgname", source,XPathConstants.STRING);
        } catch (Exception e) {
            e.printStackTrace();
        }

        System.out.println("Document #" + index + " Company: " + orgName);
    }//end while loop that goes through each line (patent document) in file

The beginning of each line in the input file contains the following after the DOCTYPE declaration: us-patent-grant SYSTEM "us-patent-grant-v40-2004-12-02.dtd" [ ]>

The line that causes the problem (91) is:

orgName = (String) xPath.evaluate("/agents/adressbook/orgname", 
       source,XPathConstants.STRING);

And the stacktrace is:

java.io.FileNotFoundException: C:\Users\Dave\Documents\NetBeansProjects\ParseXML\us-patent-grant-v40-2004-12-02.dtd (The system cannot find the file specified)
    at java.io.FileInputStream.open(Native Method)
    at java.io.FileInputStream.<init>(FileInputStream.java:131)
    at java.io.FileInputStream.<init>(FileInputStream.java:87)
    at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
    at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:616)
Document #0 Company: 
    at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:1293)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startDTDEntity(XMLEntityManager.java:1260)
    at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.setInputSource(XMLDTDScannerImpl.java:263)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(XMLDocumentScannerImpl.java:1164)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(XMLDocumentScannerImpl.java:1050)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:938)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
    at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:348)
    at com.sun.org.apache.xpath.internal.jaxp.XPathImpl.evaluate(XPathImpl.java:466)
    at Parser.main(Parser.java:102)
--------------- linked to ------------------
javax.xml.xpath.XPathExpressionException: java.io.FileNotFoundException: C:\Users\Dave\Documents\NetBeansProjects\ParseXML\us-patent-grant-v40-2004-12-02.dtd (The system cannot find the file specified)
    at com.sun.org.apache.xpath.internal.jaxp.XPathImpl.evaluate(XPathImpl.java:473)
    at Parser.main(Parser.java:102)
Caused by: java.io.FileNotFoundException: C:\Users\Dave\Documents\NetBeansProjects\ParseXML\us-patent-grant-v40-2004-12-02.dtd (The system cannot find the file specified)
    at java.io.FileInputStream.open(Native Method)
    at java.io.FileInputStream.<init>(FileInputStream.java:131)
    at java.io.FileInputStream.<init>(FileInputStream.java:87)
    at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
    at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:616)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:1293)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startDTDEntity(XMLEntityManager.java:1260)
    at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.setInputSource(XMLDTDScannerImpl.java:263)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(XMLDocumentScannerImpl.java:1164)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(XMLDocumentScannerImpl.java:1050)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:938)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
    at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:348)
    at com.sun.org.apache.xpath.internal.jaxp.XPathImpl.evaluate(XPathImpl.java:466)

Can someone help me figure out what I should be doing to parse a document in a string?

Community
  • 1
  • 1
David Farthing
  • 237
  • 2
  • 13

2 Answers2

1

Try setting features or supply empty EntityResolver

For features you need to find what parser implementation do you use (they are implementation specific)

Make DocumentBuilder.parse ignore DTD references

Community
  • 1
  • 1
Vovka
  • 599
  • 3
  • 10
0

Have you tried supplying the DTD file it's trying to reference, e.g. download it from us-patent-application-v40-2004-12-02.dtd?

You can try putting this file in the same folder as the XML; or in the current directory of the parsing process (try both since you're in a hurry).

Community
  • 1
  • 1
LarsH
  • 27,481
  • 8
  • 94
  • 152