I have a text document in which each line is an entire US patent XML document. I am trying to parse it to remove certain features like the patent number, etc. I haven't used XPath before, so I'm borrowing some code I found from Ravi Thapliyal at Parse XML Simple String using Java XPath. However, apparently the initial !DOCTYPE tag is causing the DocumentBuilder to try to find the actual document somewhere?
Here is my first attempt at code:
//convert entire file to ArrayList of strings
ArrayList<String> doc = new ArrayList<>();
while(input.hasNext()){
doc.add(input.nextLine().trim());
}
int index = 0;
while(index < doc.size()){
String xml = doc.get(index);
XPathFactory xpathFactory = XPathFactory.newInstance();
XPath xPath = xpathFactory.newXPath();
InputSource source = new InputSource(new StringReader(xml));
db.setEntityResolver(new EntityResolver() {
public InputSource resolveEntity(String publicId, String systemId)
throws SAXException, java.io.IOException {
if (systemId.contains("us-patent-grant-v40-2004-12-02.dtd")) {
return new InputSource(new StringReader(""));
} else {
return null;
}
}
});
String orgName = "";
try {
orgName = (String) xPath.evaluate("/agents/adressbook/orgname", source,XPathConstants.STRING);
} catch (Exception e) {
e.printStackTrace();
}
System.out.println("Document #" + index + " Company: " + orgName);
}//end while loop that goes through each line (patent document) in file
The beginning of each line in the input file contains the following after the DOCTYPE declaration: us-patent-grant SYSTEM "us-patent-grant-v40-2004-12-02.dtd" [ ]>
The line that causes the problem (91) is:
orgName = (String) xPath.evaluate("/agents/adressbook/orgname",
source,XPathConstants.STRING);
And the stacktrace is:
java.io.FileNotFoundException: C:\Users\Dave\Documents\NetBeansProjects\ParseXML\us-patent-grant-v40-2004-12-02.dtd (The system cannot find the file specified)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:131)
at java.io.FileInputStream.<init>(FileInputStream.java:87)
at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:616)
Document #0 Company:
at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:1293)
at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startDTDEntity(XMLEntityManager.java:1260)
at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.setInputSource(XMLDTDScannerImpl.java:263)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(XMLDocumentScannerImpl.java:1164)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(XMLDocumentScannerImpl.java:1050)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:938)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:348)
at com.sun.org.apache.xpath.internal.jaxp.XPathImpl.evaluate(XPathImpl.java:466)
at Parser.main(Parser.java:102)
--------------- linked to ------------------
javax.xml.xpath.XPathExpressionException: java.io.FileNotFoundException: C:\Users\Dave\Documents\NetBeansProjects\ParseXML\us-patent-grant-v40-2004-12-02.dtd (The system cannot find the file specified)
at com.sun.org.apache.xpath.internal.jaxp.XPathImpl.evaluate(XPathImpl.java:473)
at Parser.main(Parser.java:102)
Caused by: java.io.FileNotFoundException: C:\Users\Dave\Documents\NetBeansProjects\ParseXML\us-patent-grant-v40-2004-12-02.dtd (The system cannot find the file specified)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:131)
at java.io.FileInputStream.<init>(FileInputStream.java:87)
at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:616)
at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:1293)
at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startDTDEntity(XMLEntityManager.java:1260)
at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.setInputSource(XMLDTDScannerImpl.java:263)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(XMLDocumentScannerImpl.java:1164)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(XMLDocumentScannerImpl.java:1050)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:938)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:348)
at com.sun.org.apache.xpath.internal.jaxp.XPathImpl.evaluate(XPathImpl.java:466)
Can someone help me figure out what I should be doing to parse a document in a string?