5

I am trying to parse an input xml file that is 13,00,000 lines long with a size of 56 MB, using xsltproc. I get the below error:

input.xml:245393: parser error : internal error: Huge input lookup
              "description" : "List of values for possible department codes"
                          ^
unable to parse input.xml

My xsltproc was able to process an xml file that was 9,30,000 lines long with a size of 48 MB.

In fact, I tried decreasing the xml lines to 600,000 by removing the unnecessary parts. Still, same error, which is strange, because it is able to parse 900,000 but not 600,000.

How do I resolve this issue?

AutoTester999
  • 528
  • 1
  • 6
  • 25
  • There are some lookup limit defined in https://gitlab.gnome.org/GNOME/libxml2/blob/master/include/libxml/parserInternals.h#L66 but `maxLength` as `30`sounds rather like an XSD schema related problem. Is that document referring to a schema? Is the error occuring with some `xsl:key` processing? – Martin Honnen Dec 13 '19 at 06:19
  • "maxLength:30" can be ignored. It's just a string in my input xml. Is there a way I can increase the XML_MAX_LOOKUP_LIMIT? I tried decreasing the xml lines to 600,000. Still, same error, which is strange, because it is able to parse 900,000 but not 600,000. – AutoTester999 Dec 13 '19 at 07:06
  • edited question to avoid confusion – AutoTester999 Dec 13 '19 at 07:11
  • I am no expert on xsltproc, its help lists three options you can set `--maxdepth val : increase the maximum depth (default 3000), --maxvars val : increase the maximum variables (default 15000), --maxparserdepth val : increase the maximum parser depth1`, perhaps play with changing them to see whether you get a different result. It might help to set a tag on your question for that processor, hopefully someone shows up to better interpret the error message and to tell you how to avoid it. – Martin Honnen Dec 13 '19 at 07:55
  • 4
    48Mb is not a huge document. "Huge" these days is more like 48Gb. – Michael Kay Dec 13 '19 at 09:12
  • I gave the --maxdepth, --maxvars and --maxparserdepth as 3000000. Still the same error. – AutoTester999 Dec 15 '19 at 15:42
  • I would first try another XML parser or XSLT processor, to check the problem is with xsltproc. Part of the libxml toolbox is xmllint which would parse the file without XSLT processing, so you could try to run the file through `xmlint` as well, to check whether libxml's parser alone can handle it or gives you the same error. – Martin Honnen Dec 15 '19 at 21:30
  • 2
    https://stackoverflow.com/a/32115337/252228 suggests you can edit the source of libxml2 to set the `XML_PARSE_HUGE` parser option (which then I think disables any security based restrictions/limits normally set by default). Then you need to recompile. Or try to use one of the languages like Python or PHP which use libxml2, it seems they have options (e.g. lxml in https://lxml.de/parsing.html#parser-options declares `huge_tree`) to disable the security based limits at run-time. – Martin Honnen Dec 15 '19 at 21:44
  • Martin is rigth. Almost everytime this error arise with libxml, setting PARSE_HUGE flag solves it. – Alejandro Dec 17 '19 at 13:38

3 Answers3

2

Write your own xsltproc in Python based on this snippet:

import argparse

from lxml import etree

parser = argparse.ArgumentParser()
parser.add_argument('stylesheet', help='XSLT style sheet', type=argparse.FileType('r', encoding='utf-8'))
parser.add_argument('input', help='XML input file(s)', nargs='*', type=argparse.FileType('r', encoding='utf-8'))
parser.add_argument('--output', help='The output file to create.', type=argparse.FileType('wb'))

args = parser.parse_args()

transform = etree.XSLT(etree.parse(args.stylesheet))

xml_parser = etree.XMLParser(huge_tree=True)

for xml in args.input:
    transform(etree.parse(xml, xml_parser)).write_output(args.output)

This uses lxml as suggested in this answer.

The huge_tree=True argument sets the corresponding parser option in libxml2 and thus enables it to process large files. See Parser options for more information.

Adrian W
  • 4,563
  • 11
  • 38
  • 52
1

libxslt 1.1.35 added a --huge option to xsltproc which disables some internal limits like XML_MAX_LOOKUP_LIMIT.

nwellnhof
  • 32,319
  • 7
  • 89
  • 113
0

Using Oxygen XML Editor (Xalan) resolved my issue.

AutoTester999
  • 528
  • 1
  • 6
  • 25
  • Not affiliated to and no intent to recommend any specific commerical product, just wanted to note that Altova's XmlSpy works too. You might also want to try [my solution](https://stackoverflow.com/a/70896959/2311167) which is completely free. – Adrian W Jan 28 '22 at 16:30