1

I'm trying to open an HTTP connection to a website and parse the html into a org.w3c.dom.Document class. I can open the HTTP connection and output the webpage to the console just fine, but if I pass the InputStream object to the XML parser, it hangs for a minute and outputs the error

[Fatal Error] :108:55: Open quote is expected for attribute "{1}" associated with an  element type  "onload".

Code:

private static Document getInputStream(String url) throws IOException, SAXException, ParserConfigurationException
{
  System.out.println(url);
  URL webUrl = new URL(url);
  URLConnection connection = webUrl.openConnection();
  connection.setConnectTimeout(60 * 1000);
  connection.setReadTimeout(60 * 1000);

  InputStream stream = connection.getInputStream();

  DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
  domFactory.setNamespaceAware(true);
  DocumentBuilder builder = domFactory.newDocumentBuilder();
  Document doc = builder.parse(stream); // This line is hanging
  return doc;
}

Stack trace when paused:

Thread [main] (Suspended)   
    SocketInputStream.socketRead0(FileDescriptor, byte[], int, int, int) line: not available [native method]    
    SocketInputStream.read(byte[], int, int) line: not available    
    BufferedInputStream.fill() line: not available  
    BufferedInputStream.read1(byte[], int, int) line: not available 
    BufferedInputStream.read(byte[], int, int) line: not available  
    HttpClient.parseHTTPHeader(MessageHeader, ProgressSource, HttpURLConnection) line: not available    
    HttpClient.parseHTTP(MessageHeader, ProgressSource, HttpURLConnection) line: not available  
    HttpURLConnection.getInputStream() line: not available  
    XMLEntityManager.setupCurrentEntity(String, XMLInputSource, boolean, boolean) line: not available   
    XMLEntityManager.startEntity(String, XMLInputSource, boolean, boolean) line: not available  
    XMLEntityManager.startDTDEntity(XMLInputSource) line: not available 
    XMLDTDScannerImpl.setInputSource(XMLInputSource) line: not available    
    XMLDocumentScannerImpl$DTDDriver.dispatch(boolean) line: not available  
    XMLDocumentScannerImpl$DTDDriver.next() line: not available 
    XMLDocumentScannerImpl$PrologDriver.next() line: not available  
    XMLNSDocumentScannerImpl(XMLDocumentScannerImpl).next() line: not available 
    XMLNSDocumentScannerImpl.next() line: not available 
    XMLNSDocumentScannerImpl(XMLDocumentFragmentScannerImpl).scanDocument(boolean) line: not available  
    XIncludeAwareParserConfiguration(XML11Configuration).parse(boolean) line: not available 
    XIncludeAwareParserConfiguration(XML11Configuration).parse(XMLInputSource) line: not available  
    DOMParser(XMLParser).parse(XMLInputSource) line: not available  
    DOMParser.parse(InputSource) line: not available    
    DocumentBuilderImpl.parse(InputSource) line: not available  
    DocumentBuilderImpl(DocumentBuilder).parse(InputStream) line: not available 
    MSCommunicator.getInputStream(String) line: 45  
    MSCommunicator.getGamePageFromForum(int, int, int) line: 70 
    MSCommunicator.getGamePageFromForum(int, int) line: 57  
    Game.<init>(int, int) line: 21  
    MSCommunicator.main(String[]) line: 26  
ppeterka
  • 20,583
  • 6
  • 63
  • 78
Akron
  • 1,413
  • 2
  • 13
  • 28

2 Answers2

0

You can't really just expect to parse HTML into an XML DOM tree. It's not necessarily going to be valid XML. You probably need to clean it up first. See the answers to this question:

Reading HTML file to DOM tree using Java

Community
  • 1
  • 1
artbristol
  • 32,010
  • 5
  • 70
  • 103
0

Even if the HTML page you obtained is proper and well-formed HTML, it might not be well-formed XML. For exmaple this is valid in HTML4:

<p class=myclass>Paragraph<br>Next line</p>

Whereas in XML (XHTML), this is considered valid:

<p class="myclass">Paragraph<br/>Next line</p>

Note the closed <br/> tag and the quotation around the class attribute of the p tag.

Also, the interwebs is a wild place, so content is not likely to be well-formed, that's why you need to 'take everything with a grain of salt' - even well-formedness, so you will have to use a HTML tidier, like jTidy or nekoHTML.

ppeterka
  • 20,583
  • 6
  • 63
  • 78