GZIP in Matlab for big files

Question

I have a function that unpacks a byte array Z that was packaged using the zlib library (adapted from here).

The packed data size is 4.11 GB, and the unpacked data will be 6.65GB. I have 32GB of memory, so this is well below the limit.
I tried increasing the java heap size to 15.96GB but that didn't help.
The MATLAB_JAVA environment variable points to jre1.8.0_144.

I get the cryptic error

'MATLAB array exceeds an internal Java limit.'

at the 2^nd line of this code:

import com.mathworks.mlwidgets.io.InterruptibleStreamCopier
a=java.io.ByteArrayInputStream(Z);
b=java.util.zip.GZIPInputStream(a);
isc = InterruptibleStreamCopier.getInterruptibleStreamCopier;
c = java.io.ByteArrayOutputStream;
isc.copyStream(b,c);
M=typecast(c.toByteArray,'uint8');

Attempting to implement Mark Adler's suggestion:

Z=reshape(Z,[],8);
import com.mathworks.mlwidgets.io.InterruptibleStreamCopier
a=java.io.ByteArrayInputStream(Z(:,1));
b=java.util.zip.GZIPInputStream(a);
for ct = 2:8,b.read(Z(:,ct));end
isc = InterruptibleStreamCopier.getInterruptibleStreamCopier;
c = java.io.ByteArrayOutputStream;
isc.copyStream(b,c);

But at this isc.copystream I get this error:

Java exception occurred:
java.io.EOFException: Unexpected end of ZLIB input stream

    at java.util.zip.InflaterInputStream.fill(Unknown Source)

    at java.util.zip.InflaterInputStream.read(Unknown Source)

    at java.util.zip.GZIPInputStream.read(Unknown Source)

    at java.io.FilterInputStream.read(Unknown Source)

    at com.mathworks.mlwidgets.io.InterruptibleStreamCopier.copyStream(InterruptibleStreamCopier.java:72)

    at com.mathworks.mlwidgets.io.InterruptibleStreamCopier.copyStream(InterruptibleStreamCopier.java:51)

Reading directly from file I tried to read the data directly from a file.

streamCopier = com.mathworks.mlwidgets.io.InterruptibleStreamCopier.getInterruptibleStreamCopier;
fileInStream = java.io.FileInputStream(java.io.File(filename));
fileInStream.skip(datastart);
gzipInStream = java.util.zip.GZIPInputStream( fileInStream );
baos = java.io.ByteArrayOutputStream;
streamCopier.copyStream(gzipInStream,baos);
data = baos.toByteArray;
baos.close;
gzipInStream.close;
fileInStream.close;

Works fine for small files, but with big files I get:

Java exception occurred:
java.lang.OutOfMemoryError

at the line streamCopier.copyStream(gzipInStream,baos);

Might not be really relevant but still - did you make sure that the filesystem to which you're trying to decompress support files of this size? (i.e. FAT32 supports files of up to 4GB). It might be worth adding the [tag:java] tag to the question, seeing how the function you're using uses java internally... Did you try `tall` arrays? — Dev-iL, Oct 19 '17 at 13:09
Thank you for your comment. It's all NTFS. I added java. Maybe I should indeed somehow make the streamcopier use tall arrays. I will post here if I find a solution. — Gelliant, Oct 19 '17 at 14:39

Dev-iL · Accepted Answer · 2017-10-19T18:14:04.683

The bottleneck seems to be the size of each individual Java object being created. This happens in java.io.ByteArrayInputStream(Z) since a MATLAB array cannot be fed into Java w/o conversion, and also in copyStream, where data is actually copied into the the output buffer/memory. I had a similar idea for splitting objects into chunks of allowable size (src):

function chunkDunzip(Z)
%% Imports:
import com.mathworks.mlwidgets.io.InterruptibleStreamCopier
%% Definitions:
MAX_CHUNK = 100*1024*1024; % 100 MB, just an example
%% Split to chunks:
nChunks = ceil(numel(Z)/MAX_CHUNK);
chunkBounds = round(linspace(0, numel(Z), max(2,nChunks)) );

V = java.util.Vector();
for indC = 1:numel(chunkBounds)-1
  V.add(java.io.ByteArrayInputStream(Z(chunkBounds(indC)+1:chunkBounds(indC+1))));
end

S = java.io.SequenceInputStream(V.elements);  
b = java.util.zip.InflaterInputStream(S);

isc = InterruptibleStreamCopier.getInterruptibleStreamCopier;
c = java.io.FileOutputStream(java.io.File('D:\outFile.bin'));
isc.copyStream(b,c);
c.close();

end

Several notes:

I used a FileOutputStream since it doesn't run into the internal limit of Java objects (as far as my tests went).
Increasing the Java heap memory is still required.
I demonstrated it using the deflate, and not gzip. The solution for gzip is very similar - if this is a problem, I'll modify it.

That looks like the answer to my problem. Using a `vector` first and then `add` to fill it gradually with the `bytearrayinputstream`. I will test it first thing on monday and let you know if it works. Thank you! — Gelliant, Oct 21 '17 at 08:51
Thank you, this is the solution to my problem. I now no longer have memory errors. It is now "Corrupt GZIP trailer", but that's a different problem, related to GZIP itself apperently. — Gelliant, Oct 23 '17 at 07:40
You're welcome; happy I could help! BTW, just wondering: any idea how output streams can be similarly chained (so they don't run out of memory)? — Dev-iL, Oct 23 '17 at 08:34

score 2 · Answer 2 · answered Oct 15 '17 at 15:29

2

You don't need to read it in all in one fell swoop. You can repeat b.read() calls, where you provide a byte array of some prescribed length. Repeat until it returns -1. Use the resulting chunks to build up your result in a giant MATLAB array, instead of in a giant Java array.

answered Oct 15 '17 at 15:29

Mark Adler

101,978
13
118
158

I tried this with a smaller file for which I split the bytestream in 8 pieces. However I get the error `Unexpected end of ZLIB input stream`. I updated my question to include this attempt. – Gelliant Oct 16 '17 at 16:37

GZIP in Matlab for big files

Attempting to implement Mark Adler's suggestion:

2 Answers2