Parsing files over 2.15 GB in Java using Kaitai Struct

Question

I'm parsing large PCAP files in Java using Kaitai-Struct. Whenever the file size exceeds Integer.MAX_VALUE bytes I face an IllegalArgumentException caused by the size limit of the underlying ByteBuffer.

I haven't found references to this issue elsewhere, which leads me to believe that this is not a library limitation but a mistake in the way I'm using it.

Since the problem is caused by trying to map the whole file into the ByteBuffer I'd think that the solution would be mapping only the first region of the file, and as the data is being consumed map again skipping the data already parsed.

As this is done within the Kaitai Struct Runtime library it would mean to write my own class extending fom KatiaiStream and overwrite the auto-generated fromFile(...) method, and this doesn't really seem the right approach.

The auto-generated method to parse from file for the PCAP class is.

public static Pcap fromFile(String fileName) throws IOException {
  return new Pcap(new ByteBufferKaitaiStream(fileName));
}

And the ByteBufferKaitaiStream provided by the Kaitai Struct Runtime library is backed by a ByteBuffer.

private final FileChannel fc;
private final ByteBuffer bb;

public ByteBufferKaitaiStream(String fileName) throws IOException {
    fc = FileChannel.open(Paths.get(fileName), StandardOpenOption.READ);
    bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
}

Which in turn is limitted by the ByteBuffer max size.

Am I missing some obvious workaround? Is it really a limitation of the implementation of Katiati Struct in Java?

score 2 · Accepted Answer · answered May 20 '19 at 19:50

2

There are two separate issues here:

Running Pcap.fromFile() for large files is generally not a very efficient method, as you'll eventually get all files parsed into memory array at once. A example on how to avoid that is given in kaitai_struct/issues/255. The basic idea is that you'd want to have control over how you read every packet, and then dispose of every packet after you've parsed / accounted it somehow.
2GB limit on Java's mmaped files. To mitigate that, you can use alternative RandomAccessFile-based KaitaiStream implementation: RandomAccessFileKaitaiStream — it might be slower, but it should avoid that 2GB problem.

answered May 20 '19 at 19:50

GreyCat

16,622
18
74
112

2.- Solves issues caused by the `ByteBuffer`size limit. – Julian May 21 '19 at 09:12
2

1.- Prevents loading the whole file at once and therefore prevents potential OutOfMemory issues. The provided link does this by parsing the header only using a PCAP class with an eddited kdy (removing packets). Alternatively it can also be done creating the `KatiaiStream` and parsing the `Pcap.Header` first and then parsing each `Pcap.Packet` with a custom constructor providing the `KaitaiStream`and the `Pcap.Header`. – Julian May 21 '19 at 09:21

score 1 · Answer 2 · answered May 20 '19 at 09:45

This library provides a ByteBuffer implementation which uses long offset. I haven't tried this approach but looks promising. See section Mapping Files Bigger than 2 GB

http://www.kdgregory.com/index.php?page=java.byteBuffer

public int getInt(long index)
{
    return buffer(index).getInt();
}

private ByteBuffer buffer(long index)
{
    ByteBuffer buf = _buffers[(int)(index / _segmentSize)];
    buf.position((int)(index % _segmentSize));
    return buf;
}

public MappedFileBuffer(File file, int segmentSize, boolean readWrite)
throws IOException
{
    if (segmentSize > MAX_SEGMENT_SIZE)
        throw new IllegalArgumentException(
                "segment size too large (max " + MAX_SEGMENT_SIZE + "): " + segmentSize);

    _segmentSize = segmentSize;
    _fileSize = file.length();

    RandomAccessFile mappedFile = null;
    try
    {
        String mode = readWrite ? "rw" : "r";
        MapMode mapMode = readWrite ? MapMode.READ_WRITE : MapMode.READ_ONLY;

        mappedFile = new RandomAccessFile(file, mode);
        FileChannel channel = mappedFile.getChannel();

        _buffers = new MappedByteBuffer[(int)(_fileSize / segmentSize) + 1];
        int bufIdx = 0;
        for (long offset = 0 ; offset < _fileSize ; offset += segmentSize)
        {
            long remainingFileSize = _fileSize - offset;
            long thisSegmentSize = Math.min(2L * segmentSize, remainingFileSize);
            _buffers[bufIdx++] = channel.map(mapMode, offset, thisSegmentSize);
        }
    }
    finally
    {
        // close quietly
        if (mappedFile != null)
        {
            try
            {
                mappedFile.close();
            }
            catch (IOException ignored) { /* */ }
        }
    }
}

Thanks for your suggestion. Instead of mapping several `ByteBuffers` using the provided library, I've implemented a class extending from `KaitaiStream` using a single `ByteBuffer` to map the start of the file and then it remaps as needed when the data is consumed. This works and doesn't have the problems with boundaries between buffers solved by buffer overlaps in the provided library. Anyway, as shown by [GreyCat](https://stackoverflow.com/a/56227307/4563824) `RandomAccesFileKaitaiStream` already provides an "out of the box" alternative for large files. — Julian, May 21 '19 at 09:06

Parsing files over 2.15 GB in Java using Kaitai Struct

2 Answers2