0

I have big XML files (between 500MB and 1GB) and I'm trying to filter them in order to keep only nodes with some specified attributes, in this case Prod_id. I have about 10k Prod_id that I need to filter and currently XML contains about 60k items.

Currently I'm using XSL with node.js (https://github.com/fiduswriter/xslt-processor) but it's really slow (I never saw one of them finished in 30-40 minutes).

Is there a way to increase the speed of this process? XSL is not a requirement, I can use everything.

XML Example:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<products>
    <Product Quality="approved" Name="WL6A6" Title="BeBikes comfort WL6A6" Prod_id="BBKBECOMFORTWL6A6">
        <CategoryFeatureGroup ID="10030">
            <FeatureGroup>
                <Name Value="Dettagli tecnici" langid="5"/>
            </FeatureGroup>
        </CategoryFeatureGroup>
        <Gallery />
    </Product>
    ...
    <Product Quality="approved" Name="WL6A6" Title="BeBikes comfort WL6A6" Prod_id="LAL733">
        <CategoryFeatureGroup ID="10030">
            <FeatureGroup>
                <Name Value="Dettagli tecnici" langid="5"/>
            </FeatureGroup>
        </CategoryFeatureGroup>
        <Gallery />
    </Product>
</products>

XSL I'm using

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" indent="yes"/>  
  <xsl:template match="@* | node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="
         products/Product
         [not(@Prod_id='CEESPPRIVAIPHONE4')]
         ...
         [not(@Prod_id='LAL733')]"
   />
</xsl:stylesheet>

Thanks

Vincenzo
  • 130
  • 1
  • 3
  • 12
  • Do you want to do that with node.js? Or any tool/programming language/platform? – Martin Honnen Jan 24 '20 at 15:48
  • Any free tool/language/platform is ok – Vincenzo Jan 24 '20 at 15:58
  • 1
    Given that you know the structure and simply want to read through forwards only to identify `Product` elements you want to keep or drop an XmlReader or SAX based code might help, for Python there is a similar problem answered in https://stackoverflow.com/a/42411493/252228. Of course XSLT can do it too but forwards only, not tree based XSLT is only available in XSLT 3 with streaming for which you would need Saxon EE (there is trial license). For normal XSLT 1 or 2 with "free" processors you could try whether a key speeds things up, the processor you have choosen doesn't seem to support them. – Martin Honnen Jan 24 '20 at 17:09
  • Saxon for node.js is not yet available, but hopefully it's a matter of a few weeks now. It won't offer streaming, so you will still need a lot of memory for a document this large. If you need a streaming XSLT processor, you will have to call out to Java, e.g via an HTTP request. – Michael Kay Jan 24 '20 at 17:42
  • The SAX approach suggested by @MartinHonnen would be preferred in your case. – Alejandro Jan 24 '20 at 21:07
  • Thanks, I did it with SAX Parser and Java and it worked perfectly. – Vincenzo Jan 27 '20 at 09:31

1 Answers1

1

I solved using an approach similar to this answer https://stackoverflow.com/a/13851518/1152049

Thanks

private static void filter(InputStream fileInputStream, final Set<String> prodIdToExclude) throws SAXException, TransformerException, FileNotFoundException {
        XMLReader xr = new XMLFilterImpl(XMLReaderFactory.createXMLReader()) {
            private boolean skip;

            @Override
            public void startElement(String uri, String localName, String qName, Attributes atts)
                    throws SAXException {
                if (qName.equals("Product")) {
                    String prodId = atts.getValue("Prod_id");
                    if (prodIdToExclude.contains(prodId)) {
                        skip = true;
                    } else {
                        super.startElement(uri, localName, qName, atts);
                        skip = false;
                    }
                } else {
                    if (!skip) {
                        super.startElement(uri, localName, qName, atts);
                    }
                }
            }

            public void endElement(String uri, String localName, String qName) throws SAXException {
                if (!skip) {
                    super.endElement(uri, localName, qName);
                }
            }

            @Override
            public void characters(char[] ch, int start, int length) throws SAXException {
                if (!skip) {
                    super.characters(ch, start, length);
                }
            }
        };
        Source src = new SAXSource(xr, new InputSource(fileInputStream));
        Result res = new StreamResult(new FileOutputStream("output.xml"));
        TransformerFactory.newInstance().newTransformer().transform(src, res);
    }
Vincenzo
  • 130
  • 1
  • 3
  • 12