5

I want to know which XML parser in java (if at all) can provide me the byte offset of an xml element it parses.

I am using Lucene to index my XML files and when I search a paricular word I need the output to include the XML Element , file name as well as the byte offset so that I can seek quickly to that offset.

Pratik
  • 246
  • 1
  • 9

2 Answers2

4

Have a look at VTD-XML: http://vtd-xml.sourceforge.net, the VTDNav.getContentFragment() encodes the offset and length of an element: javadoc.

You get the offset by casting it into an int (int) VTDNav.getContentFragment().

morja
  • 8,297
  • 2
  • 39
  • 59
  • Hi Pratik. I am working on a project where I think this might help me. Did you get this to work ? – Puneet Jun 01 '13 at 17:48
0

Consider StAX (javax.xml.stream), this is an example to start with:

    XMLInputFactory f = XMLInputFactory.newInstance();
    XMLStreamReader xr = f.createXMLStreamReader(new FileReader("test.xml"));
    while (xr.hasNext()) {
        int n = xr.next();
        Location l = xr.getLocation();
        switch (n) {
        case XMLStreamReader.START_ELEMENT:
            System.out.println(l.getColumnNumber());
            System.out.println(l.getLineNumber());
                                ... more 
            break;
        }
    }
Evgeniy Dorofeev
  • 133,369
  • 30
  • 199
  • 275
  • Thanks Evgeniy I am not sure how line and column number will translate into byte/character offset as each line can have variable number of bytes – Pratik Nov 24 '12 at 18:36
  • 1
    The issue is that the SAX, DOM and StAX parsers all are limited to giving `char` offsets. If the backing stream uses variable length byte strings (`UTF-8`) then unless they control the byte stream to chat stream conversion, they cannot give byte offsets. The VTD api is the only one I know that offers the byte offset, and even then if you feed it a Reader and not an InputStream it will be unable to provide the byte offset – Stephen Connolly Nov 24 '12 at 18:45
  • Is it still the case that all StAX parsers are using char offsets, or have any of them fixed this bug by now? The API docs explicitly state that getCharacterOffset() returns the byte offset when you have passed in a stream of bytes or a file. – Hakanai Jan 09 '14 at 01:21
  • Look here for a different solution with an ANTLR based parser http://stackoverflow.com/questions/43366566/using-stax-to-create-index-for-xml-for-quick-access – jschnasse May 15 '17 at 06:13