1

When I run the code below I received:

[Fatal Error] :1:1: Content is not allowed in prolog.
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.

I know the string html has not allowed content but I would like to suppress all errors.

import java.io.ByteArrayInputStream;
import java.io.InputStream;

import org.w3c.dom.*;
import org.xml.sax.InputSource;

import javax.xml.xpath.*;
import javax.xml.parsers.*;
public class Test {

    public static void main(String[] args){
        String html="---<html><div id='teste'>Teste</div><div id='ola'>Ola tudo ebm!</div></html>";

        try{

            XPath xpath = XPathFactory.newInstance().newXPath();
            String xpathExpression = "//div[@id='ola']";

            InputStream is = new ByteArrayInputStream(html.getBytes()); 
            InputSource inputSource = new InputSource(is);

            NodeList nodes = (NodeList) xpath.evaluate
            (xpathExpression, inputSource, XPathConstants.NODESET);

            int j = nodes.getLength();

            for (int i = 0; i < j; i++) {
                System.out.println(nodes.item(i).getTextContent());
            }

        } catch (Exception e) {
            e.printStackTrace();
        }

    }
}
nhahtdh
  • 55,989
  • 15
  • 126
  • 162
adrianogf
  • 171
  • 2
  • 12

3 Answers3

0

Your best bet is to create your own version of InputStream wrapping it around the ByteArrayInputStream to sanititise the data before it gets to xpath.evaluate

thedayofcondor
  • 3,860
  • 1
  • 19
  • 28
0

First, XML is not the same as HTML, and XPath works on the XML data model.

In order to solve this, you will have to find some other way of parsing your input stream, because when you parse that string, the parser that is invoked is an XML parser, and XML parsers do not have an "ignore errors" option by definition. Only valid input is allowed; the very specification of the parser says that ill formed input should cause a fatal exception.

So an alternative would be to use a different parser. There are several out there. For example, you could use JTidy. Although it parser HTML into an HTML DOM, with a little bit of glue code you can convert that so it is suitable for parsing. See Question 3361263, Library to query HTML with XPath in Java.

Community
  • 1
  • 1
lavinio
  • 23,931
  • 5
  • 55
  • 71
0

I've tried to manipulate your html and everything works for me. I confirm that I'd also a null value when I've tried to XpathEvaluate but this is how I've bypass it :)

    try {

        Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new File("D:\\Loic_Workspace\\Test2\\res\\test.xml"));


        Integer length = doc.getElementsByTagName("div").getLength();


        if(length != null){

            for(int i=0;i<length;i++){


                if(doc.getElementsByTagName("div").item(i).getAttributes().item(0).getTextContent().equals("ola")){
                    System.out.println(doc.getElementsByTagName("div").item(i).getTextContent());
                }



            }


        }





    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (SAXException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (ParserConfigurationException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

Output in the console : Ola tudo ebm!

doc.getElementsByTagName("div").item(i).getAttributes().item(0) --> is the reference of the 'id' attribute in the document. I retrieve the text content of this element by the .getText() method.

I know that it's not the most efficient method but it works :)

Hope it's helps,