0

I have been trying to create a Java program that will read zip files from an online API, unzip them into memory (not into the file system), and load them into a database. Since the unzipped files need to be loaded into the database in a specific order, I will have to unzip all of the files before I load any of them.

I basically used another question on StackOverflow as a model on how to do this. Using ZipInputStream from util.zip I was able to do this with a smaller ZIP (0.7MB zipped ~4MB unzipped), but when I encountered a larger file (25MB zipped, 135MB unzipped), the two largest files were not read into memory. I was not even able to retrieve a ZipEntry for these larger files (8MB and 120MB, the latter making up the vast majority of the data in the zip file). No exceptions were thrown, and my program proceeded until it tried to access tha the unzipped files that failed to be written, and threw NullPointerException.

I am using Jsoup to get the zipfile from online.

Has anyone had any experience with this and can give guidance on why I am unable to retrieve the complete contents of the zip file?

Below is the code that I am using. I am collecting unzipped files as InputStreams in a HashMap, and when there are no more ZipEntrys, the program should stop looking for ZipEntrys when there are no more left.

    private Map<String, InputStream> unzip(ZipInputStream verZip) throws IOException {

        Map<String, InputStream> result = new HashMap<>();

        while (true) {
            ZipEntry entry;
            byte[] b = new byte[1024];
            ByteArrayOutputStream out = new ByteArrayOutputStream();
            int l;

            entry = verZip.getNextEntry();//Might throw IOException

            if (entry == null) {
                break;
            }

            try {
                while ((l = verZip.read(b)) > 0) {
                    out.write(b, 0, l);
                }
                out.flush();
            }catch(EOFException e){
                e.printStackTrace();
            }
            catch (IOException i) {
                System.out.println("there was an ioexception");
                i.printStackTrace();
                fail();
            }
            result.put(entry.getName(), new ByteArrayInputStream(out.toByteArray()));
        }
        return result;
    }

Might I be better off if my program took advantage of the filesystem to unzip files?

fairground
  • 61
  • 1
  • 7
  • 1
    You should not try to hold all of that data in memory. Just go through the `ZipEntry`s when it’s time to write to the database, instead of trying to put them all in a Map. Also, you need `while ((l = verZip.read(b)) >= 0)`—note the `>=` instead of `>`. Otherwise, your code will stop reading data the first time it encounters a zero byte. – VGR Dec 13 '19 at 00:24
  • @VGR I edited the post in response to this question. The unzipped files need to be loaded into the database in a specific order, so I have to unzip all of the files before I load any of them. – fairground Dec 13 '19 at 01:49
  • 1
    @VGR Your comment about the read loop is incorrect. He is reading into a buffer, and the return value is a count, not a byte value. This code can never return zero at all, unless `b[]` is zero length. – user207421 Dec 13 '19 at 01:59
  • 2
    How do you know no exceptions were thrown when you are ignoring them? – user207421 Dec 13 '19 at 02:02
  • @user207421 Oops, you’re right. Ignore what I said about `> 0`. – VGR Dec 13 '19 at 03:03
  • 1
    Is that required insertion order different from the order of the entries in the zip file? It seems to me that you are losing any ordering by storing the names in a HashMap, which is unordered by design. – VGR Dec 13 '19 at 03:05
  • @user207421 Thanks for pointing that out. It looks like I made the silly mistake of throwing away an exception. After taking away the try/catch around `entry = verZip.getNextEntry();`. So I was causing the silent failing. The IOException I was catching says "Unexpected end of ZLIB input stream". – fairground Dec 13 '19 at 03:18
  • @VGR exactly. I then have a List which contains the key values of the HashMap in the order that values would need to be loaded into the database. – fairground Dec 13 '19 at 03:20

1 Answers1

1

It turns out that Jsoup is the root of the issue. When obtaining binary data with a Jsoup connection, there is a limit to how many bytes will be read from the connection. By default, this limit is 1048576, or 1 megabyte. As a result, when I feed the binary data from Jsoup into a ZipInputStream, the resulting data is cut off after one megabyte. This limit, maxBodySizeBytes can be found in org.jsoup.helper.HttpConnection.Request.

        Connection c = Jsoup.connect("example.com/download").ignoreContentType(true);
        //^^returns a Connection that will only retrieve 1MB of data
        InputStream oneMb = c.execute().bodyStream();
        ZipInputStream oneMbZip = new ZipInputStream(oneMb);

Trying to unzip the truncated oneMbZip is what led me to get the EOFException

With the code below, I was able to change Connection's byte limit to 1 GB (1073741824), and then was able to retrieve the zip file without running into an EOFException.

        Connection c = Jsoup.connect("example.com/download").ignoreContentType(true);
        //^^returns a Connection that will only retrieve 1MB of data
        Connection.Request theRequest = c.request();
        theRequest.maxBodySize(1073741824);
        c.request(theRequest);//Now this connection will retrieve as much as 1GB of data
        InputStream oneGb = c.execute().bodyStream();
        ZipInputStream oneGbZip = new ZipInputStream(oneGb);

Note that maxBodySizeBytes is an int and its upper limit is 2,147,483,647, or just under 2GB.

fairground
  • 61
  • 1
  • 7