Reading a huge Zip file in java - Out of Memory Error

Question

I am reading a ZIP file using java as below:

Enumeration<? extends ZipEntry> zes=zip.entries();
    while(zes.hasMoreElements()) {
        ZipEntry ze=zes.nextElement();
        // do stuff..
    }

I am getting an out of memory error, the zip file size is about 160MB. The stacktrace is as below:

Exception in thread "Timer-0" java.lang.OutOfMemoryError: Java heap space
at java.util.zip.InflaterInputStream.<init>(InflaterInputStream.java:88)
at java.util.zip.ZipFile$1.<init>(ZipFile.java:229)
at java.util.zip.ZipFile.getInputStream(ZipFile.java:229)
at java.util.zip.ZipFile.getInputStream(ZipFile.java:197)
at com.aesthete.csmart.batches.batchproc.DatToInsertDBBatch.zipFilePass2(DatToInsertDBBatch.java:250)
at com.aesthete.csmart.batches.batchproc.DatToInsertDBBatch.processCompany(DatToInsertDBBatch.java:206)
at com.aesthete.csmart.batches.batchproc.DatToInsertDBBatch.run(DatToInsertDBBatch.java:114)
at java.util.TimerThread.mainLoop(Timer.java:534)
at java.util.TimerThread.run(Timer.java:484)

How do I enumerate the contents of a big zip file without having increase my heap size? Also when I dont enumerate the contents and just access a single file like this:

ZipFile zip=new ZipFile(zipFile);
ZipEntry ze=zip.getEntry("docxml.xml");

Then I dont get an out of memory error. Why does this happen? How does a Zip file handle zip entries? The other option would be to use a ZIPInputStream. Would that have a small memory footprint. I would need to run this code eventually on a micro EC2 instance on the Amazon cloud (613 MB RAM)

EDIT: providing more information on how I process the zip entries after I get them

Enumeration<? extends ZipEntry> zes=zip.entries();
    while(zes.hasMoreElements()) {
        ZipEntry ze=zes.nextElement();
        S3Object s3Object=new S3Object(bkp.getCompanyFolder()+map.get(ze.getName()).getRelativeLoc());
            s3Object.setDataInputStream(zip.getInputStream(ze));
            s3Object.setStorageClass(S3Object.STORAGE_CLASS_REDUCED_REDUNDANCY);
            s3Object.addMetadata("x-amz-server-side-encryption", "AES256");
            s3Object.setContentType(Mimetypes.getInstance().getMimetype(s3Object.getKey()));
            s3Object.setContentDisposition("attachment; filename="+FilenameUtils.getName(s3Object.getKey()));
            s3objs.add(s3Object);
    }

I get the zipinputstream from the zipentry and store that in the S3object. I collect all the S3Objects in a list and then finally upload them to Amazon S3. For those who dont know Amazon S3, its a file storage service. You upload the file via HTTP.

I am thinking maybe since i collect all the individual inputstreams this is happening? Would it help if I batched it up? Like a 100 inputstreams at a time? Or would it be better if I unzipped it first and then used the unzipped file to upload rather storing streams?

The Micro EC2 instance type will not be suitable for unzipping large files. It supports only very brief periods of CPU work. If the unzipping takes longer than 2-5sec, then quite simply on a Micro instance this Will Not Work(tm). [They're suited to quick and simple web request handling only, really: even something like installing the .NET framework takes ~30min because it uses CPU] — Kieren Johnstone, Dec 28 '11 at 09:42
Kieren, at the moment I am running it on a local ubuntu server which has 2GM RAM :). If it fails here, I am sure it wont work on the micro instance, hence the question. But once I fix it using Codo suggestion, do you still think there might be an issue? All I am doing is, downloading a zip file from S3, unzipping it and uploading it back to S3 in a java batch program. Would that be CPU intensive? Also I running a tomcat and mysql db on that same instance. Would it become that bad? — sethu, Dec 28 '11 at 10:05
If it takes less than 10 seconds you are going to be OK, unless you need to do it frequently. If it takes more than 10, the CPU available to the instance will be cut very short, and it will probably take a few minutes, slowing down the whole instance — Kieren Johnstone, Dec 28 '11 at 16:07
To process each zip file it will less than 10s for sure. But there will be many of them that need to processed at a time. But I understand what you are saying. But either way I dont have a choice. Cant afford a small instance. So will have to live with the instance not slowing down too much. As long as it is bearable. With my batch running it was making the CPU go upto 10%. (checked by running top) I think thats okay isn't it? — sethu, Dec 28 '11 at 18:15

Codo · Accepted Answer · 2011-12-28T09:45:37.923

It is very unlikley that you get an out of memory exception because of processing a ZIP file. The Java classes ZipFile and ZipEntry don't contain anything that could possibly fill up 613 MB of memory.

What could exhaust your memory is to keep the decompressed files of the ZIP archive in memory, or - even worse - keeping them as an XML DOM, which is very memory intensive.

Switching to another ZIP library will hardly help. Instead, you should look into changing your code so that it processes the ZIP archive and the contained files like streams and only keeps a limited part of each file in memory at a time.

BTW: I would be nice if you could provide more information about the huge ZIP files (do they contain many small files or few large files?) and about what you do with each ZIP entry.

Update:

Thanks for the additional information. It looks like you keep the contents of the ZIP file in memory (although it somewhat depends on the implementation of the S3Object class, which I don't know).

It's probably best to implement some sort of batching as you propose yourself. You could for example add up the decompressed size of each ZIP entry and upload the files every time the total size exceeds 100 MB.

Thanks for the answer. I am sure you are right. I have edited my question with more information on how I am processing the zip file.. Could you please check that? — sethu, Dec 28 '11 at 09:21
Also the number of files are large but each file is small. Max size 5MB. Mainly small pdf forms and excel and doc documents. — sethu, Dec 28 '11 at 09:34

score 1 · Answer 2 · answered Dec 28 '11 at 08:26

1

You're using ZipFile class now, as I see. Probably usage ZipInputStream will be a better option because it has 'closeEntry()' method which (as I hope) deallocates memory resources used by an entry. But I haven't used it before, it's just a guess.

answered Dec 28 '11 at 08:26

Wizart

940
5
8

score 0 · Answer 3 · answered Dec 28 '11 at 07:59

0

The default size of a JVM is 64MB. You need to specify a larger size on the command line. use the -Xmx switch. E.g. -Xmx256m

answered Dec 28 '11 at 07:59

Fortyrunner

12,702
4
31
54

Unfortunately, I cant do that because my RAM size is limited. A max of 613 MB actually. – sethu Dec 28 '11 at 08:02
The Java Tutorial has a nice section on Zip file handling. http://java.sun.com/developer/technicalArticles/Programming/compression/ – Fortyrunner Dec 28 '11 at 08:09
The Java Tutorial link is now to a generic Oracle Java page. Anyone have an updated URL? – Coke Mar 28 '13 at 17:35

score 0 · Answer 4 · answered Dec 28 '11 at 08:22

0

Indeed, java.util.zip.ZipFile has a size() method, but doesn't provide a method to access entries by index. Perhaps you need to use a different ZIP library. As I remember, I used TrueZIP with rather large archives.

answered Dec 28 '11 at 08:22

weekens

8,064
6
45
62

Reading a huge Zip file in java - Out of Memory Error

4 Answers4

Linked