1

I'm new with java, and I want an opinion for the community. I Have a huge XML, that contains a lot of information. Actually, this XML has approximately 140Mb of information. In this XML I have a lot of information that is no more valid, so I need to do filter and use only the valid one, to check this I need to cross information between node, to check if deletion is needed or not. In some cases, the entire father(main) node needs to be deleted.

I'm already doing it with dom parse, using loops, inside the loops I save in variables and cross the information to check, and delete the actual node or the entire father node.

Basically, the structure is like this:

<source>
    <main>
        <id>98567</id>
        <block_information>
            <name>Block A</name>
            <start_date>20120210</start_date>
            <end_date>20150210</end_date>
        </block_information>
        <block_information>
            <name>Block A.01</name>
            <start_date>20150210</start_date>
            <end_date>20251005</end_date>
        </block_information>
        <city_information>
            <name>Manchester</name>
            <start_date>20150210</start_date>
            <end_date>20150212</end_date>
        </city_information>
        <city_information>
            <name>New Manchester</name>
            <start_date>20150212</start_date>
            <end_date>20251005</end_date>
        </city_information>
        <phone>
            <type>C</type>
            <number>987466321</number>
            <name></name>
        </phone>
        <phone>
            <type>P</type>
            <number>36547821</number>
            <name></name>
        </phone>
    </main>
    <main>
        <id>19587</id>
        <block_information>
            <name>Che</name>
            <start_date>20090210</start_date>
            <end_date>20100210</end_date>
        </block_information>
        <block_information>
            <name></name>
            <start_date>20100210</start_date>
            <end_date>20351005</end_date>
        </block_information>
        <city_information>
            <name></name>
            <start_date>20150210</start_date>
            <end_date>20150212</end_date>
        </city_information>
        <city_information>
            <name>No Name</name>
            <start_date>20150212</start_date>
            <end_date>20191005</end_date>
        </city_information>
        <phone>
            <type>C</type>
            <number>987466321</number>
            <name>Mom</name>
        </phone>
        <phone>
            <type>P</type>
            <number>36547821</number>
            <name></name>
        </phone>
    </main>
</source>

The output is like this:

<result>
        <main>
                <id>98567</id>
                <block_name>Block A.01</block_name>
                <city_name>New Manchester</city_name>
                <cellphone></cellphone>
                <phone>36547821</phone>
                <contact_phone></contact_phone>
                <contact_phone_name></contact_phone_name>
        </main>
</result>

For the information go out in result, is mandatory that there is one <block_information> and <city_information> valid (<start_date> less than actual date and <end_date> bigger than actual date), and the <name...> is needed for both. If there is none, or more than one valid, the <main> will be deleted.

For the phone number, <type> ['C' is for contact, 'P' for personal phone, 'M' for mobile]. So if the <type> is 'C' but there is no value in <name> the phone do not go to result. 'P' go to <phone> and 'M' go to <cellphone>.

I want your considerations on what is the best way to do that in the most performative way, and to anyone can do adjustment before in an easy way if it's needed.

thanks in advance for the inputs!

as asked by @kjhughes, I put some values on the sample XML, and some filters that I need to do. Thanks!

ps.: the XML structure used as an example is TOO simple compared to the actual one, there are a lot more complex types.

AJ Siegel
  • 74
  • 6
  • Once you specify your criteria for which nodes to filter and which to allow to pass, it's a simple identity transformation adaptation in XSLT. If you need more specific information, you'll have to narrow the scope of your broad question. – kjhughes Jul 18 '19 at 20:18

2 Answers2

0

I would go with the following approach:

  • find a library that lets you stream the xml (file or inputsream) and produce a Stream<Main>
  • process the Stream<Main> and filter each Main node according to your validation logic
  • depending if you are I/O or CPU bottlenecked use a .parallel() stream to process the stream (read: test if .parallel() helps you in any way)

This will suffice for any sane performance requirements in the context of XML parsing (I guess?). Google for Java XML Stream and go from there (or maybe this stackoverflow question can give some pointers)

roookeee
  • 1,710
  • 13
  • 24
  • Thanks! I will do a search and look at this, I will come back when a learn something! I'm using inputstream, then using dom.parse, and after that going node to node to check and delete. The main goal is to see if this way is the best or there is another one :) – AJ Siegel Jul 18 '19 at 20:46
  • Good luck! The key is to not `dom.parse` the whole file as one, that's where some libraries come in handy that let you stream your nodes :) – roookeee Jul 18 '19 at 20:47
0

XSLT is a transformation language existing since 1999 which has now three versions, 1.0, 2.0, and 3.0, the latest version published as W3C recommendation in 2017 and supported on the Java platform by Saxon 9.8 and later, available in the open-source HE edition on Sourceforge and Maven. The use of XSLT 1 is supported in the Oracle/Sun Java JRE by incorporating Apache Xalan.

So instead of using DOM you have the option to use XSLT, here is an example using XSLT 3 (online at https://xsltfiddle.liberty-development.net/bFN1yab/0):

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:mf="http://example.com/mf"
    exclude-result-prefixes="#all"
    version="3.0">

  <xsl:output indent="yes"/>

  <xsl:function name="mf:date" as="xs:date">
      <xsl:param name="input-date" as="xs:string"/>
      <xsl:sequence
         select="xs:date(replace($input-date, '([0-9]{4})([0-9]{2})([0-9]{2})', '$1-$2-$3'))"/>
  </xsl:function>

  <xsl:function name="mf:select-valid-info" as="element()*">
      <xsl:param name="infos" as="element()*"/>
      <xsl:sequence
         select="$infos[name/normalize-space()
                       and mf:date(start_date) lt current-date()
                       and mf:date(end_date) gt current-date()]"/>
  </xsl:function>

  <xsl:function name="mf:valid-main" as="xs:boolean">
      <xsl:param name="main" as="element(main)"/>
      <xsl:sequence
        select="let $valid-blocks := mf:select-valid-info($main/block_information),
                    $valid-cities := mf:select-valid-info($main/city_information)
                return count($valid-blocks) eq 1 and count($valid-cities) eq 1"/>
  </xsl:function>

  <xsl:mode on-no-match="shallow-copy"/>

  <xsl:template match="main[not(mf:valid-main(.))]"/>

  <xsl:template match="main[mf:valid-main(.)]">
      <xsl:copy>
          <xsl:apply-templates 
            select="id,
                    mf:select-valid-info(block_information)/name,
                    mf:select-valid-info(city_information)/name,
                    phone"/>
      </xsl:copy>
  </xsl:template>

  <xsl:template match="block_information/name | city_information/name">
      <xsl:element name="{substring-before(local-name(..), '_')}_name">
          <xsl:value-of select="."/>
      </xsl:element>
  </xsl:template>

  <xsl:template match="main/phone[type = 'C']">
      <contact_phone>
          <xsl:value-of select="number[current()/normalize-space(name)]"/>
      </contact_phone>
      <contact_name>
          <xsl:value-of select="name"/>
      </contact_name>
  </xsl:template>

  <xsl:template match="main/phone[type = 'P']">
      <phone>
          <xsl:value-of select="number"/>
      </phone>
  </xsl:template>

  <xsl:template match="main/phone[type = 'M']">
      <cellphone>
          <xsl:value-of select="number"/>
      </cellphone>
  </xsl:template>

</xsl:stylesheet>

I hope I have grasped the conditions for the main elements, I have not been able to quite understand the rules for the various phone data, but the code is meant as an example anyway.

Of course performance depends very much on the implementation but I think that XSLT is a more structured and maintainable way than doing DOM coding.

If you can afford it you can also look into Saxon 9.8 or 9.9 EE which supports streaming XSLT 3 where, with some rewrites of above code, you could have an XSLT based approach to stream forwards only through the huge document, materializing main elements as element nodes you transform while keeping the memory footprint low as that approach, in comparison to DOM or normal XSLT processing, doesn't parse the whole XML document first into a complete in-memory tree structure:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:mf="http://example.com/mf"
    exclude-result-prefixes="#all"

    version="3.0">

    <xsl:mode streamable="yes" on-no-match="shallow-copy"/>

    <xsl:template match="source">
        <xsl:copy>
            <xsl:apply-templates select="main!copy-of()" mode="main"/>
        </xsl:copy>
    </xsl:template>

    <xsl:output indent="yes"/>

    <xsl:function name="mf:date" as="xs:date">
        <xsl:param name="input-date" as="xs:string"/>
        <xsl:sequence
            select="xs:date(replace($input-date, '([0-9]{4})([0-9]{2})([0-9]{2})', '$1-$2-$3'))"/>
    </xsl:function>

    <xsl:function name="mf:select-valid-info" as="element()*">
        <xsl:param name="infos" as="element()*"/>
        <xsl:sequence
            select="$infos[name/normalize-space()
            and mf:date(start_date) lt current-date()
            and mf:date(end_date) gt current-date()]"/>
    </xsl:function>

    <xsl:function name="mf:valid-main" as="xs:boolean">
        <xsl:param name="main" as="element(main)"/>
        <xsl:sequence
            select="let $valid-blocks := mf:select-valid-info($main/block_information),
            $valid-cities := mf:select-valid-info($main/city_information)
            return count($valid-blocks) eq 1 and count($valid-cities) eq 1"/>
    </xsl:function>

    <xsl:mode name="main" on-no-match="shallow-copy"/>

    <xsl:template match="main[not(mf:valid-main(.))]" mode="main"/>

    <xsl:template match="main[mf:valid-main(.)]" mode="main">
        <xsl:copy>
            <xsl:apply-templates 
                select="id,
                mf:select-valid-info(block_information)/name,
                mf:select-valid-info(city_information)/name,
                phone" mode="#current"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="block_information/name | city_information/name" mode="main">
        <xsl:element name="{substring-before(local-name(..), '_')}_name">
            <xsl:value-of select="."/>
        </xsl:element>
    </xsl:template>

    <xsl:template match="main/phone[type = 'C']" mode="main">
        <contact_phone>
            <xsl:value-of select="number[current()/normalize-space(name)]"/>
        </contact_phone>
        <contact_name>
            <xsl:value-of select="name"/>
        </contact_name>
    </xsl:template>

    <xsl:template match="main/phone[type = 'P']" mode="main">
        <phone>
            <xsl:value-of select="number"/>
        </phone>
    </xsl:template>

    <xsl:template match="main/phone[type = 'M']" mode="main">
        <cellphone>
            <xsl:value-of select="number"/>
        </cellphone>
    </xsl:template>

</xsl:stylesheet>
Martin Honnen
  • 160,499
  • 6
  • 90
  • 110