How to parse a ZIP file from by-demand-generated-InputStream without having a large byte-array?

Question

Background

I've been trying to figure out how to deal with problematic ZIP files using a stream out of them.

The reasons for it:

Some ZIP files are not originated from a file-path. Some are from some Uri, some are even within another ZIP file.
Some ZIP files are quite problematic in opening, so together with the previous point, it is impossible to use just what the framework has to offer. Example of such files as "XAPK" files from APKPure website (example here).

As one of the possible solutions I've searched for, I asked about memory allocation via JNI, to hold the entire ZIP file inside the RAM, while using Apache's ZipFile class which can handle a zip file in various ways and not just via file-path.

The problem

Such a thing seems to work (here) very well, but it has some problems:

You don't always have the available memory.
You don't know for sure what's the max memory you are allowed to allocate without risking your app from crashing
In case you've accidentally chosen too much memory to allocate, you won't be able to catch it, and the app will crash. It's not like on Java, where you can safely use try-catch and save you from OOM (and if I'm wrong and you can, please let me know, because that's a very good thing to know about JNI) .

So, let's assume you can always create the InputStream (via Uri or from within an existing zip file), how could you parse it as a zip file?

What I've found

I've made a working sample that can do it, by using Apache's ZipFile, and letting it traverse the Zip file as if it has all in memory.

Each time it asks to read some bytes from some position, I re-create the inputStream.

It works fine, but the problem with this is that I can't optimize it well to minimize the amount of times I re-create the inputStream. I tried to at least cache the current InputStream, and that if it's good enough, re-use it (skip from current position if needed), and if the required position is before current one, re-create the inputStream. Sadly it failed in some cases (such as the XAPK file I've mentioned above), as it causes EOFException.

Currently I've worked only with Uri, but a similar solution can be done for InputStream you re-generate from within another zip-file.

Here's the inefficient solution (sample available here, including both inefficient solution and the one I tried to make better), which seems to always work well:

InefficientSeekableInputStreamByteChannel.kt

@RequiresApi(Build.VERSION_CODES.N)
abstract class InefficientSeekableInputStreamByteChannel : SeekableByteChannel {
    private var position: Long = 0L
    private var cachedSize: Long = -1L
    private var buffer = ByteArray(DEFAULT_BUFFER_SIZE)
    abstract fun getNewInputStream(): InputStream

    override fun isOpen(): Boolean = true

    override fun position(): Long = position

    override fun position(newPosition: Long): SeekableByteChannel {
//        Log.d("AppLog", "position $newPosition")
        require(newPosition >= 0L) { "Position has to be positive" }
        position = newPosition
        return this
    }

    open fun calculateSize(): Long {
        return getNewInputStream().use { inputStream: InputStream ->
            if (inputStream is FileInputStream)
                return inputStream.channel.size()
            var bytesCount = 0L
            while (true) {
                val available = inputStream.available()
                if (available == 0)
                    break
                val skip = inputStream.skip(available.toLong())
                if (skip < 0)
                    break
                bytesCount += skip
            }
            bytesCount
        }
    }

    final override fun size(): Long {
        if (cachedSize < 0L)
            cachedSize = calculateSize()
//        Log.d("AppLog", "size $cachedSize")
        return cachedSize
    }

    override fun close() {
    }

    override fun read(buf: ByteBuffer): Int {
        var wanted: Int = buf.remaining()
//        Log.d("AppLog", "read wanted:$wanted")
        if (wanted <= 0)
            return wanted
        val possible = (calculateSize() - position).toInt()
        if (possible <= 0)
            return -1
        if (wanted > possible)
            wanted = possible
        if (buffer.size < wanted)
            buffer = ByteArray(wanted)
        getNewInputStream().use { inputStream ->
            inputStream.skip(position)
            //now we have an inputStream right on the needed position
            inputStream.readBytesIntoByteArray(buffer, wanted)
        }
        buf.put(buffer, 0, wanted)
        position += wanted
        return wanted
    }

    //not needed, because we don't store anything in memory
    override fun truncate(size: Long): SeekableByteChannel = this

    override fun write(src: ByteBuffer?): Int {
        //not needed, we read only
        throw  NotImplementedError()
    }
}

InefficientSeekableInUriByteChannel.kt

@RequiresApi(Build.VERSION_CODES.N)
class InefficientSeekableInUriByteChannel(someContext: Context, private val uri: Uri) : InefficientSeekableInputStreamByteChannel() {
    private val applicationContext = someContext.applicationContext

    override fun calculateSize(): Long = StreamsUtil.getStreamLengthFromUri(applicationContext, uri)

    override fun getNewInputStream(): InputStream = BufferedInputStream(
            applicationContext.contentResolver.openInputStream(uri)!!)
}

Usage:

val file = ...
val uri = Uri.fromFile(file)
parseUsingInefficientSeekableInUriByteChannel(uri)
...
    private fun parseUsingInefficientSeekableInUriByteChannel(uri: Uri): Boolean {
        Log.d("AppLog", "testing using SeekableInUriByteChannel (re-creating inputStream when needed) ")
        try {
            val startTime = System.currentTimeMillis()
            ZipFile(InefficientSeekableInUriByteChannel(this, uri)).use { zipFile: ZipFile ->
                val entriesNamesAndSizes = ArrayList<Pair<String, Long>>()
                for (entry in zipFile.entries) {
                    val name = entry.name
                    val size = entry.size
                    entriesNamesAndSizes.add(Pair(name, size))
                    Log.v("Applog", "entry name: $name - ${numberFormat.format(size)}")
                }
                val endTime = System.currentTimeMillis()
                Log.d("AppLog", "got ${entriesNamesAndSizes.size} entries data. time taken: ${endTime - startTime}ms")
                return true
            }
        } catch (e: Throwable) {
            Log.e("AppLog", "error while trying to parse using SeekableInUriByteChannel:$e")
            e.printStackTrace()
        }
        return false
    }

And here's my attempt in improving it, which didn't work in some cases:

SeekableInputStreamByteChannel.kt

@RequiresApi(Build.VERSION_CODES.N)
abstract class SeekableInputStreamByteChannel : SeekableByteChannel {
    private var position: Long = 0L
    private var actualPosition: Long = 0L
    private var cachedSize: Long = -1L
    private var inputStream: InputStream? = null
    private var buffer = ByteArray(DEFAULT_BUFFER_SIZE)
    abstract fun getNewInputStream(): InputStream

    override fun isOpen(): Boolean = true

    override fun position(): Long = position

    override fun position(newPosition: Long): SeekableByteChannel {
//        Log.d("AppLog", "position $newPosition")
        require(newPosition >= 0L) { "Position has to be positive" }
        position = newPosition
        return this
    }

    open fun calculateSize(): Long {
        return getNewInputStream().use { inputStream: InputStream ->
            if (inputStream is FileInputStream)
                return inputStream.channel.size()
            var bytesCount = 0L
            while (true) {
                val available = inputStream.available()
                if (available == 0)
                    break
                val skip = inputStream.skip(available.toLong())
                if (skip < 0)
                    break
                bytesCount += skip
            }
            bytesCount
        }
    }

    final override fun size(): Long {
        if (cachedSize < 0L)
            cachedSize = calculateSize()
//        Log.d("AppLog", "size $cachedSize")
        return cachedSize
    }

    override fun close() {
        inputStream.closeSilently().also { inputStream = null }
    }

    override fun read(buf: ByteBuffer): Int {
        var wanted: Int = buf.remaining()
//        Log.d("AppLog", "read wanted:$wanted")
        if (wanted <= 0)
            return wanted
        val possible = (calculateSize() - position).toInt()
        if (possible <= 0)
            return -1
        if (wanted > possible)
            wanted = possible
        if (buffer.size < wanted)
            buffer = ByteArray(wanted)
        var inputStream = this.inputStream
        //skipping to required position
        if (inputStream == null) {
            inputStream = getNewInputStream()
//            Log.d("AppLog", "getNewInputStream")
            inputStream.skip(position)
            this.inputStream = inputStream
        } else {
            if (actualPosition > position) {
                inputStream.close()
                actualPosition = 0L
                inputStream = getNewInputStream()
//                Log.d("AppLog", "getNewInputStream")
                this.inputStream = inputStream
            }
            inputStream.skip(position - actualPosition)
        }
        //now we have an inputStream right on the needed position
        inputStream.readBytesIntoByteArray(buffer, wanted)
        buf.put(buffer, 0, wanted)
        position += wanted
        actualPosition = position
        return wanted
    }

    //not needed, because we don't store anything in memory
    override fun truncate(size: Long): SeekableByteChannel = this

    override fun write(src: ByteBuffer?): Int {
        //not needed, we read only
        throw  NotImplementedError()
    }
}

SeekableInUriByteChannel.kt

@RequiresApi(Build.VERSION_CODES.N)
class SeekableInUriByteChannel(someContext: Context, private val uri: Uri) : SeekableInputStreamByteChannel() {
    private val applicationContext = someContext.applicationContext

    override fun calculateSize(): Long = StreamsUtil.getStreamLengthFromUri(applicationContext, uri)

    override fun getNewInputStream(): InputStream = BufferedInputStream(
            applicationContext.contentResolver.openInputStream(uri)!!)
}

The questions

Is there a way to improve it?

Maybe by having as little re-creation of InputStream as possible?

Are there more possible optimizations that will let it parse ZIP files well? Maybe some caching of the data?

I ask this because it seems it's quite slow compared to other solutions I've found, and I think this could help a bit.

I think that there are some good ideas in [this answer](https://stackoverflow.com/a/12293817/6287910) from Jon Skeet if you haven't already seen it. — Cheticamp, May 18 '20 at 16:29
@Cheticamp Doesn't seem fit to the question, because he talks about converting to byte array at some point. Plus I don't see any solution in code. All seem theoretical — android developer, May 19 '20 at 14:10
The idea I took away from that post was the ability to read down to the central directory at the end of the file to understand how the contents are organized. You might even be able to pull some essential data out while scanning. That would be one pass of the file. That information could be used to organize the actual read of the file to pull out entries that would take no more than one additional pass. Whether this would be a good way to go or not would depend on your needs. — Cheticamp, May 19 '20 at 14:45
The way I did it, is to let the zip-parsing be free of it, so that whenever it wishes to reach some place in the file, I go there, read the needed bytes, and give it back to it. That's how I avoid the need to learn how to parse zip files, while also trying to minimize the memory usage because I only read what I'm requested to read. The problem is that I think it's not quite an optimized one. I tried to re-use an already opened InputStream in case it's in a place that not after the requested one, but for some reason it sometimes causes issues with parsing. — android developer, May 19 '20 at 17:40
Another possible approach would be to cache byte arrays (like cache pages), so each time I reach some place, I could read a few KB ahead, so that if it would be requested soon, I could offer it right away. Anyway, I'm asking here how I can open any ZIP file without the fear of using too much memory, given that you can re-create the InputStream whenever you wish, while trying to have it working as best as it can (compared to File-based usage, which seems faster in most cases). — android developer, May 19 '20 at 17:43
Proceeding with the same technique, you could maintain a pool of readers (# TBD) in an LRU sort of cache. The reader closest to the desired point in the file, but not past that point, would be selected for a read operation. If all in-use readers are past the desired place in the file, a new reader would be started if available. If all readers are in use and past the desired place in the file, the least recently used reader would be selected and reopened at the start of the file. A little complicated but it may produce a performance improvement. — Cheticamp, May 20 '20 at 13:42
Right, but as I wrote, sadly even a pool of 1 sized InputStream didn't work for me for some reason. I have no idea why my solution doesn't work for 1 cached InputStream in some cases. If it would have worked, I could have come up with more similar ideas... — android developer, May 20 '20 at 17:32

score 1 · Accepted Answer · answered May 29 '20 at 20:05

1

The skip() method of BufferedInputStream doesn't always skip all the bytes that you specify. In SeekableInputStreamByteChannel change the following code

inputStream.skip(position - actualPosition)

to

var bytesToSkip = position - actualPosition
while (bytesToSkip > 0) {
    bytesToSkip -= inputStream.skip(bytesToSkip)
}

and that should work.

Regarding making things more efficient, the first thing that ZipFile does is to zoom to the end of the file to get the central directory (CD). With the CD in hand, ZipFile knows the make up of the zip file. The zip entries should be in the same order as the files are laid out. I would read the files you want in that same order to avoid backtracking. If you can't guarantee the read order then maybe multiple input streams would make sense.

answered May 29 '20 at 20:05

Cheticamp

61,413
10
78
131

Seems right! I handled the skipping fine in some places but here for some reason I missed it. And it indeed made it faster according to my plans. For example it took 39,481ms instead of 117,223ms to get all entries of all APK files of all installed apps. It seems that for this task it usually creates a new InputStream only twice. This might be a better solution than loading all into RAM, at least in some cases . The RAM solution is still a bit faster (27,940ms in the same test), though, and other solutions are faster too (when possible), while ZipFile is fastest: 2,372ms – android developer May 30 '20 at 11:22
I've pushed the update to fix it all. You can see the results and comparison too if you wish. Thank you very much! – android developer May 30 '20 at 11:24
Say, can you please offer the JNI solution as a library that everyone could use, via a dependency? For some reason I failed to do it... :( – android developer May 30 '20 at 12:43

How to parse a ZIP file from by-demand-generated-InputStream without having a large byte-array?

Background

The problem

What I've found

The questions

1 Answers1