Using StAX to create index for XML for quick access

Question

Is there a way to use StAX and JAX-B to create an index and then get quick access to an XML file?

I have a large XML file and I need to find information in it. This is used in a desktop application and so it should work on systems with few RAM.

So my idea is this: Create an index and then quickly access data from the large file.

I can't just split the file because it's an official federal database that I want to use unaltered.

Using a XMLStreamReader I can quickly find some element and then use JAXB for unmarshalling the element.

    final XMLStreamReader r = xf.createXMLStreamReader(filename, new FileInputStream(filename));
    final JAXBContext ucontext = JAXBContext.newInstance(Foo.class);
    final Unmarshaller unmarshaller = ucontext.createUnmarshaller();
    r.nextTag();

    while (r.hasNext()) {

        final int eventType = r.next();
        if (eventType == XMLStreamConstants.START_ELEMENT && r.getLocalName().equals("foo")
                && Long.parseLong(r.getAttributeValue(null, "bla")) == bla
                ) {
            // JAX-B works just fine:
            final JAXBElement<Foo> foo = unmarshaller.unmarshal(r,Foo.class);
            System.out.println(foo.getValue().getName());
            // But how do I get the offset?
            // cache.put(r.getAttributeValue(null, "id"), r.getCursor()); // ???
            break;
        }
    }

But I can't get the offset. I'd like to use this to prepare an index:
(id of element) -> (offset in file)

Then I should be able use the offset to just unmarshall from there: Open file stream, skip that many bytes, unmarshall. I can't find a library that does this. And I can't do it on my own without knowing the position of the file cursor. The javadoc clearly states that there is a cursor, but I can't find a way of accessing it.

Edit:
I'm just trying to offer a solution that will work on old hardware so people can actually use it. Not everyone can afford a new and powerful computer. Using StAX I can get the data in about 2 seconds, which is a bit long. But it doesn't require RAM. It requires 300 MB of RAM to just use JAX-B. Using some embedded db system would just be a lot of overhead for such a simple task. I'll use JAX-B anyway. Anything else would be useless for me since the wsimport-generated classes are already perfect. I just don't want to load 300 MB of objects when I only need a few.

I can't find a DB that just needs an XSD to create an in-memory DB, which doesn't use that much RAM. It's all made for servers or it's required to define a schema and map the XML. So I assume it just doesn't exist.

Why not an in-memory database? XML is a horrible way to *store* information. — Kayaman, Apr 12 '17 at 10:14
In-memory would use 3 times as much memory as the uncompressed xml file. That's a bit too much for a desktop application. So I'd like to use indexed access. — Claude Martin, Apr 12 '17 at 12:12
Really? Which in-memory db did you try? H2? SQLite? XML is acceptable as a transport format, but as a storage format (with manipulation) it's pure garbage really. — Kayaman, Apr 12 '17 at 12:15
@jschnasse: Yes, but the javadoc states: "All the information provided by a Location is optional. For example an application may only report line numbers." And I get the location of the cursor, but of the current element. It's already read to the next element. It's not the beginning of the element that caused `START_ELEMENT`. — Claude Martin, Apr 12 '17 at 12:15
@Kayaman : No, I mean JAX-B, which simply creates Java objects from the XSD. How would I the data to H2? — Claude Martin, Apr 12 '17 at 12:16
Why are you talking about Java XML Binding when I'm talking about in-memory databases? And don't ask how, go find out, you're the programmer, you're supposed to be able to do some research on your own. — Kayaman, Apr 12 '17 at 12:18
Here's what I have: An XML file and a XSD file. All I want is to access (read only) the data without loading hundreds of MB of data. I searched for a library. I guess it doesn't exist. But I don't see why not. Is this such a rare use case of XML files? There's eXist-db, but they say it's only good for many small files, not for one large file. I'd have to write my own parser and indexer so I can then use JAX-B. — Claude Martin, Apr 12 '17 at 13:01
Like I said. XML is an acceptable *transport* format. Your described solution sounds like you're making an address book software for a school project, not any serious software designed to handle large amounts of data. Why do you have the data (still) as XML? You're trying to create a home grown index for your XML...why not go for the path of least effort and put the data in a SQLite file for example? — Kayaman, Apr 12 '17 at 13:04
It's not relational data. It's a tree document. SQL makes little sense. JPA would be interesting if there's an implementation that allows me to access XML files. — Claude Martin, Apr 12 '17 at 13:30
How does it matter how large it is? Each year it grows, because there's more data. Not every desktop has 16 GB of RAM. — Claude Martin, Apr 21 '17 at 13:11
I played around with Stax `getLocation().getCharacterOffset()` but for more complex files nothing worked. I decided to provide an answer based on a generated XML parser using antlr. I'm excited if this works out for you! — jschnasse, Apr 27 '17 at 20:51

jschnasse · Accepted Answer · 2018-03-08T09:26:13.717

You could work with a generated XML parser using ANTLR4.

The Following works very well on a ~17GB Wikipedia dump /20170501/dewiki-20170501-pages-articles-multistream.xml.bz2 but I had to increase heap size using -xX6GB.

1. Get XML Grammar

cd /tmp
git clone https://github.com/antlr/grammars-v4

2. Generate Parser

cd /tmp/grammars-v4/xml/
mvn clean install

3. Copy Generated Java files to your Project

cp -r target/generated-sources/antlr4 /path/to/your/project/gen

4. Hook in with a Listener to collect character offsets

package stack43366566;

import java.util.ArrayList;
import java.util.List;

import org.antlr.v4.runtime.ANTLRFileStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.tree.ParseTreeWalker;

import stack43366566.gen.XMLLexer;
import stack43366566.gen.XMLParser;
import stack43366566.gen.XMLParser.DocumentContext;
import stack43366566.gen.XMLParserBaseListener;

public class FindXmlOffset {

    List<Integer> offsets = null;
    String searchForElement = null;

    public class MyXMLListener extends XMLParserBaseListener {
        public void enterElement(XMLParser.ElementContext ctx) {
            String name = ctx.Name().get(0).getText();
            if (searchForElement.equals(name)) {
                offsets.add(ctx.start.getStartIndex());
            }
        }
    }

    public List<Integer> createOffsets(String file, String elementName) {
        searchForElement = elementName;
        offsets = new ArrayList<>();
        try {
            XMLLexer lexer = new XMLLexer(new ANTLRFileStream(file));
            CommonTokenStream tokens = new CommonTokenStream(lexer);
            XMLParser parser = new XMLParser(tokens);
            DocumentContext ctx = parser.document();
            ParseTreeWalker walker = new ParseTreeWalker();
            MyXMLListener listener = new MyXMLListener();
            walker.walk(listener, ctx);
            return offsets;
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    public static void main(String[] arg) {
        System.out.println("Search for offsets.");
        List<Integer> offsets = new FindXmlOffset().createOffsets("/tmp/dewiki-20170501-pages-articles-multistream.xml",
                        "page");
        System.out.println("Offsets: " + offsets);
    }

}

5. Result

Prints:

Offsets: [2441, 10854, 30257, 51419 ....

6. Read from Offset Position

To test the code I've written class that reads in each wikipedia page to a java object

@JacksonXmlRootElement
class Page {
   public Page(){};
   public String title;
}

using basically this code

private Page readPage(Integer offset, String filename) {
        try (Reader in = new FileReader(filename)) {
            in.skip(offset);
            ObjectMapper mapper = new XmlMapper();
             mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
            Page object = mapper.readValue(in, Page.class);
            return object;
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

Find complete example on github.

`ANTLRInputStream` is deprecated but this works with `CharStreams.fromPath`. I still need to check if I can use the offset with Jaxb to load an element without reading the complete file. So far this looks very promising and it is the answer to my question on how the get the offset of elements. — Claude Martin, May 12 '17 at 10:22
Looking forward. I just tried to read a really large file ~17GB (Wikipedia dump). Increased heap to -Xmx6g then stumbled upon this http://stackoverflow.com/questions/24225568/negativearraysizeexception-antlrv4 . So I'm not sure if my answer will endure. — jschnasse, May 12 '17 at 11:51
The problem seems to be in ANTLRIputStream. With ANTLRFileStream (also deprecated) it seems to work, even for the 17GB file. Still had to increase heap size. I will take a look at CharStreams. — jschnasse, May 12 '17 at 12:05
It works! The data I need to parse isn't that large. I use a file of around 100MB for testing. You already used `start.getStartIndex()`. But I also need the length, which is: `ctx.stop.getStartIndex() - ctx.start.getStartIndex() + 1`. I made my own `InputStream`, which also has "limit", not just "skip". Then I need to create a `StreamSource` and use my custom stream (`setInputStream`) and then I just use an `XMLStreamReader` to unmarshal my object. I obviously know the type because I look for certain elements in the XML file. Unmarshalling takes only 0.3 seconds per element on my system. — Claude Martin, May 12 '17 at 14:20
It's actually `ctx.stop.getStopIndex() - ctx.start.getStartIndex() + 1` — Claude Martin, May 16 '17 at 12:46
I edited my answer. The provided code works on a 17GB wikipedia dump. Under my github account I provide a complete example that reads in the first 50 `` elements from character offsets to Java objects using JAXB. — jschnasse, May 16 '17 at 15:30
0 down vote Using antlr to generate the parser is brilliant. The resulting parser not only provides the offsets I need, it also proves to be a perfect utility parser for general use. Well done! — Scott, Sep 14 '18 at 08:31
Hi, Thanks for this response. I am doing something similar but unable to marshall few events to my class. If you get a chance can you please have a look at this question and provide your observation? https://stackoverflow.com/questions/67667516/jaxb-moxy-unmarshalling-for-large-file-fails-for-some-events — BATMAN_2008, May 24 '21 at 09:07

score 2 · Answer 2 · answered Sep 08 '20 at 23:22

I just had to solve this problem, and spent way too much time figuring it out. Hopefully the next poor soul who comes looking for ideas can benefit from my suffering.

The first problem to contend with is that most XMLStreamReader implementations provide inaccurate results when you ask them for their current offsets. Woodstox however seems to be rock-solid in this regard.

The second problem is the actual type of offset you use. You have to use char offsets if you need to work with a multi-byte charset, which means the random-access retrieval from the file using the provided offsets is not going to be very efficient - you can't just set a pointer into the file at your offset and start reading, you have to read through until you get to the offset (that's what skip does under the covers in a Reader), then start extracting. If you're dealing with very large files, that means retrieval of content near the end of the file is too slow.

I ended up writing a FilterReader that keeps a buffer of byte offset to char offset mappings as the file is read. When we need to get the byte offset, we first ask Woodstox for the char offset, then get the custom reader to tell us the actual byte offset for the char offset. We can get the byte offset from the beginning and end of the element, giving us what we need to go in and surgically extract the element from the file by opening it as a RandomAccessFile, which means it's super fast at any point in the file.

I created a library for this, it's on GitHub and Maven Central. If you just want to get the important bits, the party trick is in the ByteTrackingReader.

Some people have commented about how this whole thing is a bad idea and why would you want to do it? XML is a transport mechanism, you should just import it to a DB and work with the data with more appropriate tools. For most cases this is true, but if you're building applications or integrations that communicate via XML, you need tooling to analyze and operate on the files that are exchanged. I get daily requests to verify feed contents, having the ability to quickly extract a specific set of items from a massive file and verify not only the contents, but the format itself is essential.

Anyhow, hopefully this can save someone a few hours, or at least get them closer to a solution.