5

I need to parse a large complex xml and write to a Flat file, can you give some advise?

File size: 500MB Record count: 100K XML structure:

<Msg>

    <MsgHeader>
        <!--Some of the fields in the MsgHeader need to be map to a java object-->
    </MsgHeader>

    <GroupA> 
        <GroupAHeader/>
        <!--Some of the fields in the GroupAHeader need to be map to a java object--> 
        <GroupAMsg/>
        <!--50K records--> 
        <GroupAMsg/> 
        <GroupAMsg/> 
        <GroupAMsg/> 
    </GroupA>

    <GroupB> 
        <GroupBHeader/> 
        <GroupBMsg/>
        <!--50K records--> 
        <GroupBMsg/> 
        <GroupBMsg/> 
        <GroupBMsg/> 
    </GroupB>

</Msg>
Xstian
  • 8,184
  • 10
  • 42
  • 72
Weber
  • 89
  • 1
  • 8
  • 4
    is there a specific language you're gonna use? – Ahmad Y. Saleh Dec 19 '12 at 12:12
  • Does the structure of the file have to be checked, or may you assume it to be valid per sé? – Thilo Dec 19 '12 at 12:30
  • I'm using Java, JAXB/Spring Batch is the prefered option, I have read lots of posts but still have no idea on how to process above xml effectively. – Weber Dec 19 '12 at 13:01
  • In the future you should include that information in your question and especially in the tags. The world of software development is very, *very* large and the number of possible ways to address a question like this are inconceivably huge, so you have to narrow it down to what is actually useful to you. – RBarryYoung Dec 30 '12 at 13:51
  • possible duplicate of [Parsing very large XML documents (and a bit more) in java](http://stackoverflow.com/questions/355909/parsing-very-large-xml-documents-and-a-bit-more-in-java) – Kate Gregory Jan 02 '13 at 00:19

6 Answers6

1

Within Spring Batch, I've written my own stax event item reader implementation that operates a bit more specifically than previously mentioned. Basically, I just stuff elements into a map and then pass them into the ItemProcessor. From there, you're free to transform it into a single object (see CompositeItemProcessor) from the "GatheredElement". Apologies for having a little copy/paste from the StaxEventItemReader, but I don't think it's avoidable.

From here, you're free to use whatever OXM marshaller you'd like, I happen to use JAXB as well.

public class ElementGatheringStaxEventItemReader<T> extends StaxEventItemReader<T> {
    private Map<String, String> gatheredElements;
    private Set<String> elementsToGather;
    ...
    @Override
    protected boolean moveCursorToNextFragment(XMLEventReader reader) throws NonTransientResourceException {
        try { 
            while (true) {
                while (reader.peek() != null && !reader.peek().isStartElement()) {
                    reader.nextEvent();
                }
                if (reader.peek() == null) {
                    return false;
                }
                QName startElementName = ((StartElement) reader.peek()).getName();
                if(elementsToGather.contains(startElementName.getLocalPart())) {
                    reader.nextEvent(); // move past the actual start element
                    XMLEvent dataEvent = reader.nextEvent();
                    gatheredElements.put(startElementName.getLocalPart(), dataEvent.asCharacters().getData());
                    continue;
                }
                if (startElementName.getLocalPart().equals(fragmentRootElementName)) {
                    if (fragmentRootElementNameSpace == null || startElementName.getNamespaceURI().equals(fragmentRootElementNameSpace)) {
                        return true;
                    }
                }
                reader.nextEvent();

            }
        } catch (XMLStreamException e) {
            throw new NonTransientResourceException("Error while reading from event reader", e);
        }
    }

    @SuppressWarnings("unchecked")
    @Override
    protected T doRead() throws Exception {
        T item = super.doRead();
        if(null == item)
            return null;
        T result = (T) new GatheredElementItem<T>(item, new     HashedMap(gatheredElements));
        if(log.isDebugEnabled())
            log.debug("Read GatheredElementItem: " + result);
        return result; 
    }

The gathered element class is pretty basic:

public class GatheredElementItem<T> {
    private final T item;
    private final Map<String, String> gatheredElements;
    ...
}
0

I haven't dealt with such huge file sizes, but considering your problem, since you want to parse the and write to a flat file, I'm guessing a combination XML Pull Parsing and smart code to write to the flat file (this might help), because we don't want to exhaust the Java heap. You can do a quick Google search for tutorials and sample code on using XML Pull Parsing.

Community
  • 1
  • 1
Waleed Almadanat
  • 1,027
  • 10
  • 24
  • Yes, JAXB/Spring Batch is the preferred option, but have no idea on how to parse above complex xml effectively. I'm a newbie in large xml parsing. Any comments will be appreciated. – Weber Dec 19 '12 at 13:06
0

At last, I implement a customized StaxEventItemReader.

  1. Config fragmentRootElementName

  2. Config my own manualHandleElement

    <property name="manualHandleElement">
    <list>
        <map>
            <entry>
                <key><value>startElementName</value></key>
                <value>GroupA</value>
            </entry>
            <entry>
                <key><value>endElementName</value></key>
                <value>GroupAHeader</value>
            </entry>
            <entry>
                <key><value>elementNameList</value></key>
                    <list>
                            <value>/GroupAHeader/Info1</value>
                            <value>/GroupAHeader/Info2</value>
                    </list>
            </entry>
        </map>
    </list>
    

  3. Add following fragment in MyStaxEventItemReader.doRead()

    while(true){
    if(reader.peek() != null && reader.peek().isStartElement()){
        pathList.add("/"+((StartElement) reader.peek()).getName().getLocalPart());
        reader.nextEvent();
        continue;
    }
    if(reader.peek() != null && reader.peek().isEndElement()){
        pathList.remove("/"+((EndElement) reader.peek()).getName().getLocalPart());
        if(isManualHandleEndElement(((EndElement) reader.peek()).getName().getLocalPart())){
            pathList.clear();
            reader.nextEvent();
            break;
        }
        reader.nextEvent();
        continue;
    }
    if(reader.peek() != null && reader.peek().isCharacters()){
        CharacterEvent charEvent = (CharacterEvent)reader.nextEvent();
        String currentPath = getCurrentPath(pathList);
        String startElementName = (String)currentManualHandleStartElement.get(MANUAL_HANDLE_START_ELEMENT_NAME);
        for(Object s : (List)currentManualHandleStartElement.get(MANUAL_HANDLE_ELEMENT_NAME_LIST)){
            if(("/"+startElementName+s).equals(currentPath)){
                map.put(getCurrentPath(pathList), charEvent.getData());
                break;
            }
        }
        continue;
    }
    
    reader.nextEvent();
    

    }

Weber
  • 89
  • 1
  • 8
0

give a try to some ETL tool like

Pentaho Data Integration (AKA Kettle)

jacktrade
  • 3,125
  • 2
  • 36
  • 50
0

If you accept an solution aside JAXB/Spring Batch, you may want to have a look at the SAX Parser.

This is a more event-oriented way of parsing XML files and may be a good approach when you want to directly write into the target file while parsing. The SAX Parser is not reading the whole xml content into memory but triggers methods when it enconters elements in the inputstream. As far as I have experienced it, this is a very memory-efficient way of processing.

In comparison to your Stax-Solution, SAX 'pushes' the data into your application - this means that you have to maintain the state (like in which tag you are corrently), so you have to keep track of your current location. I'm not sure if that is something you really require

The following example reads in an xml file in your structure and prints out all text within GroupBMsg-Tags:

import java.io.FileReader;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.XMLReaderFactory;

public class SaxExample implements ContentHandler
{
    private String currentValue;

    public static void main(final String[] args) throws Exception
    {
        final XMLReader xmlReader = XMLReaderFactory.createXMLReader();

        final FileReader reader = new FileReader("datasource.xml");
        final InputSource inputSource = new InputSource(reader);

        xmlReader.setContentHandler(new SaxExample());
        xmlReader.parse(inputSource);
    }

    @Override
    public void characters(final char[] ch, final int start, final int length) throws     SAXException
    {
        currentValue = new String(ch, start, length);
    }

    @Override
    public void startElement(final String uri, final String localName, final String     qName, final Attributes atts) throws SAXException
    {
        // react on the beginning of tag "GroupBMsg" <GroupBMSg>
        if (localName.equals("GroupBMsg"))
        {
            currentValue="";
        }
    }

    @Override
    public void endElement(final String uri, final String localName, final String     qName) throws SAXException
    {
        // react on the ending of tag "GroupBMsg" </GroupBMSg>
        if (localName.equals("GroupBMsg"))
        {
            // TODO: write into file
            System.out.println(currentValue);
        }
    }


    // the rest is boilerplate code for sax

    @Override
    public void endDocument() throws SAXException {}
    @Override
    public void endPrefixMapping(final String prefix) throws SAXException {}
    @Override
    public void ignorableWhitespace(final char[] ch, final int start, final int length)
        throws SAXException {}
    @Override
    public void processingInstruction(final String target, final String data)
        throws SAXException {}
    @Override
    public void setDocumentLocator(final Locator locator) {  }
    @Override
    public void skippedEntity(final String name) throws SAXException {}
    @Override
    public void startDocument() throws SAXException {}
    @Override
    public void startPrefixMapping(final String prefix, final String uri)
      throws SAXException {}
}
roemer
  • 772
  • 5
  • 10
0

You can use Declarative Stream Mapping (DSM) stream parsing library. It can process both JSON and XML. It doesn't load XML file in to memory. DSM only process data that you defined in YAML or JSON config.

You can call method while reading XML.This allows you to process XML partially. You can deserialzie this partially read XML data to Java object.

Even you can use it to read in multiple thread.

You can find good example in this Answer

Unmarshalling XML to three lists of different objects using STAX Parser

JAVA - Best approach to parse huge (extra large) JSON file (same for XML)

mfe
  • 1,158
  • 10
  • 15