4

I try to transform XML document using XSLT. As an input I have www.wordpress.org XHTML source code, and XSLT is dummy example retrieving site's title (actually it could do nothing - it doesn't change anything).

Every single API or library I use, transformation takes about 2 minutes! If you take a look at wordpress.org source, you will notice that it is only 183 lines of code. As I googled it is probably due to DOM tree building. No matter how simple XSLT is, it is always 2 minutes - so it confirms idea that it's related to DOM building, but anyway it should not take 2 minutes in my opinion.

Here is an example code (nothing special):

  TransformerFactory tFactory = TransformerFactory.newInstance();
   Transformer transformer = null;

   try {
       transformer = tFactory.newTransformer(
           new StreamSource("/home/pd/XSLT/transf.xslt"));

   } catch (TransformerConfigurationException e) {
       e.printStackTrace();
   }

   ByteArrayOutputStream outputStream = new ByteArrayOutputStream();

   System.out.println("START");
   try {
       transformer.transform(new SAXSource(new InputSource(
           new FileInputStream("/home/pd/XSLT/wordpress.xml"))),
           new StreamResult(outputStream));
   } catch (TransformerException e) {       
       e.printStackTrace();
   } catch (IOException e) {
       e.printStackTrace();
   }
   System.out.println("STOP");

   System.out.println(new String(outputStream.toByteArray()));

It's between START and STOP where java "pauses" for 2 minutes. If I take a look at the processor or memory usage, nothing increases. It looks like really JVM stopped...

Do you have any experience in transforming XMLs that are longer than 50 (this is random number ;)) lines? As I read XSLT always needs to build DOM tree in order to do its work. Fast transformation is crucial for me.

Thanks in advance, Piotr

omnomnom
  • 8,911
  • 4
  • 41
  • 50
  • How large is `wordpress.xml`? – David Weiser Jan 25 '11 at 21:40
  • It is www.wordpress.org XHTML -it is 183 lines long (already wrapped) – omnomnom Jan 25 '11 at 21:41
  • Could you post the xml and the xslt files? Something else: if your output is ByteArrayOutputStream, that might cause you problems, since normally your input will be RTF-8, but not your output. – luiscolorado Jan 25 '11 at 21:49
  • http://www.copypastecode.com/62601/ here is input XML (www.wordpress.org website source) – omnomnom Jan 25 '11 at 21:54
  • The problem is in your xslt.. post that and someone may be able to help you. – Spaceghost Jan 25 '11 at 22:36
  • @Piotrek De: I think this has nothing to do with the transformation but the XML parser: it's trying to retrive the DTD. So, you have to set the parser to stop this behavior (my choise), or you have to cheat the parser with a special URI resolver as in [this question](http://stackoverflow.com/questions/1572808/java-xml-xslt-prevent-dtd-validation), or you have to modify the "SYSTEM" URI of the DOCTYPE declaration of the input source into a relative local URI (and keep a local copy of the DTD, of course). –  Jan 25 '11 at 23:25

4 Answers4

9

Does the sample HTML file use namespaces? If so, your XML parser may be attempting to retrieve contents (a schema, perhaps) from the namespace URIs. This is likely if each run takes exactly two minutes -- it's likely one or more TCP timeouts.

You can verify this by timing how long it takes to instantiate your InputSource object (where the WordPress XML is actually parsed), as this is likely the line which is causing the delay. After reviewing the sample file you posted, it does include a declared namespace (xmlns="http://www.w3.org/1999/xhtml").

To work around this, you can implement your own EntityResolver which essentially disables the URL-based resolution. You may need to use a DOM -- see DocumentBuilder's setEntityResolver method.

Here's a sample using DOM and disabling resolution (note -- this is untested):

try {
    DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
    DocumentBuilder db = dbFactory.newDocumentBuilder();
    db.setEntityResolver(new EntityResolver() {

        @Override
        public InputSource resolveEntity(String publicId, String systemId) throws SAXException, IOException {
            return null; // Never resolve any IDs
        }
    });

    System.out.println("BUILDING DOM");

    Document doc = db.parse(new FileInputStream("/home/pd/XSLT/wordpress.xml"));

    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();

    TransformerFactory tFactory = TransformerFactory.newInstance();
    Transformer transformer = tFactory.newTransformer(
        new StreamSource("/home/pd/XSLT/transf.xslt"));

    System.out.println("RUNNING TRANSFORM");

    transformer.transform(
            new DOMSource(doc.getDocumentElement()),
            new StreamResult(outputStream));

    System.out.println("TRANSFORMED CONTENTS BELOW");
    System.out.println(outputStream.toString());
} catch (Exception e) {
    e.printStackTrace();
}

If you want to use SAX, you would have to use a SAXSource with an XMLReader which uses your custom resolver.

Phil M
  • 6,633
  • 1
  • 23
  • 12
  • Indeed, some XML parsers do that to validate the document. You can turn it off by configuration. However, that's XML parser specific. OP has to mention what JAXP implementation he is using (e.g. Xerces, Saxon, Crimson, etc), then it can be consulted in its documentation. – BalusC Jan 25 '11 at 21:54
  • Thanks! You are right, it is all about downloading DTD and all other files included in this DTD. Both solutions with EntityResolver works nice, but there is one limitation - I need to know which DTD will be needed, prepare its cached instance and make it available in EntityResolver. What if I don't know input XML (it is given by client in a runtime)? If so, I don't know which DTD will be needed. Is there any way to "hijack" DTD downloaded by transformer (assume that my entity resolver returned null), cache it and when next time this DTD will be needed return it from this cache? – omnomnom Jan 26 '11 at 20:09
  • Before you go down the road of caching/pre-loading DTDs in a custom resolver, you should determine if the DTDs are even required for the problem you're trying to solve. Since you're trying to apply XSLTs to XHTML, I would guess not (note that the DTDs are being resolved during the XML parsing phase, not the transform phase). – Phil M Jan 27 '11 at 19:53
  • 2
    better (for xerces/xalan at least) than return null may be return new InputSource(new StringReader("")). And/or factory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false); – JasonPlutext Aug 18 '13 at 00:53
  • Thanks JasonPlutext return null; was taking along time still so I did return new InputSource(new StringReader("")); and it is fast now. – PHPGuru Dec 22 '14 at 18:46
  • Also note that you may be able to enclose a local copy of the XSD's/DTD's with your application and tell the parser about it so the local copies are used. – Thorbjørn Ravn Andersen Sep 14 '15 at 19:33
2

The commenters who've posted that the answer likely resides with the EntityResolver are probably correct. However, the solution may not be to simply not load the schemas but rather load them from the local file system.

So you could do something like this

  db.setEntityResolver(new EntityResolver() {

    @Override
    public InputSource resolveEntity(String publicId, String systemId) throws SAXException, IOException {
        try {
        FileInputStream fis = new FileInputStream(new File("classpath:xsd/" + systemId));
        InputSource is  = new InputSource(fis);
        return is
    } catch (FileNotFoundException ex) {
        logger.error("File Not found", ex);
        return null;
    }
    }
});
Karthik Ramachandran
  • 11,925
  • 10
  • 45
  • 53
  • Thanks! You are right, it is all about downloading DTD and all other files included in this DTD. Both solutions with EntityResolver works nice, but there is one limitation - I need to know which DTD will be needed, prepare its cached instance and make it available in EntityResolver. What if I don't know input XML (it is given by client in a runtime)? If so, I don't know which DTD will be needed. Is there any way to "hijack" DTD downloaded by transformer (assume that my entity resolver returned null), cache it and when next time this DTD will be needed return it from this cache? – omnomnom Jan 26 '11 at 20:10
  • This should be relatively straight forward. First check to see if a file with the systemId of the DTD exists (the systemID is the filename itself.) If the file doesn't exist locally then use the publicId as a URL, fetch the DTD, and write it out tot the local file system. Return an input source that uses the file you downloaded. Let me know if you need code. – Karthik Ramachandran Jan 26 '11 at 22:42
  • Thanks for answer, unfortunately (probably) it's not so easy. Let's take http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd as an example. It includes three other files which should be accessible using entity resolver: xhtml-lat1.ent, xhtml-symbol.ent and xhtml-special.ent. When I print out system and public ids for included ENT files it is: systemId: file:///home/pd/workspace/XMLTransf/xhtml-lat1.ent publicId: -//W3C//ENTITIES Latin 1 for XHTML//EN (... other in the same way ...) so I cannot use public nor system ID to retrieve it ;/ ("workspace/XMLTrans" is my eclipse project dir) – omnomnom Jan 27 '11 at 07:54
1

Chances are the problem isn't with the call transfomer.transform. It's more likely that you are doing something in your xslt that is taking forever. My suggestion would be use a tool like Oxygen or XML Spy to profile your XSLT and find out which templates are taking the longest to execute. Once you've determined this you can begin to optimize the template.

Karthik Ramachandran
  • 11,925
  • 10
  • 45
  • 53
  • Absolutely - the problem is in the XSLT code. The fact that the poster chose to show us the Java code and not the XSLT code suggests they don't really know where to start investigating this. – Michael Kay Jan 26 '11 at 08:58
0

If you are debugging your code on an android device, make sure you try it without eclipse attached to the process. When I was debugging my app xslt transformations were taking 8 seconds, where the same process took a tenth of a second on ios in native code. Once I ran the code without eclipse attached to it, the process took a comparable amount of time to the c based counterpart.

user1532390
  • 303
  • 3
  • 9