5

For good or bad I have been using code like the following without any problems:

ZipFile aZipFile = new ZipFile(fileName);   
InputStream zipInput = aZipFile.getInputStream(name);  
int theSize = zipInput.available();  
byte[] content = new byte[theSize];  
zipInput.read(content, 0, theSize);

I have used it (this logic of obtaining the available size and reading directly to a byte buffer) for File I/O without any issues and I used it with zip files as well.

But recently I stepped into a case that the zipInput.read(content, 0, theSize); actually reads 3 bytes less that the theSize available.

And since the code is not in a loop to check the length returned by zipInput.read(content, 0, theSize); I read the file with the 3 last bytes missing
and later the program can not function properly (the file is a binary file).

Strange enough with different zip files of larger size e.g. 1075 bytes (in my case the problematic zip entry is 867 bytes) the code works fine!

I understand that the logic of the code is probably not the "best" but why am I suddenly getting this problem now?

And how come if I run the program immediately with a larger zip entry it works?

Any input is highly welcome

Thanks

Cratylus
  • 52,998
  • 69
  • 209
  • 339

3 Answers3

7

From the InputStream read API docs:

An attempt is made to read as many as len bytes, but a smaller number may be read.

... and:

Returns: the total number of bytes read into the buffer, or -1 if there is no more data because the end of the stream has been reached.

In other words unless the read method returns -1 there is still more data available to read, but you cannot guarantee that read will read exactly the specified number of bytes. The specified number of bytes is the upper bound describing the maximum amount of data it will read.

Adamski
  • 54,009
  • 15
  • 113
  • 152
  • I know about this.That is why I mentioned that my approach is not that good.Nevertheless I am trying to understand this behavior, of missing the last 3 bytes in a small file but not have any issue in larger files – Cratylus Oct 24 '11 at 11:29
  • 3
    @user384706: understanding that specific behaviour serves no goal: it is implementation-dependent and your code can go wrong in any number of ways, depending on many factors. It's important to understand why your code can go wrong in general and how to fix it. – Joachim Sauer Oct 24 '11 at 11:53
2

Using available() does not guarantee that it counted total available bytes to the end of stream.
Refer to Java InputStream's available() method. It says that

Returns an estimate of the number of bytes that can be read (or skipped over) from this input stream without blocking by the next invocation of a method for this input stream. The next invocation might be the same thread or another thread. A single read or skip of this many bytes will not block, but may read or skip fewer bytes.

Note that while some implementations of InputStream will return the total number of bytes in the stream, many will not. It is never correct to use the return value of this method to allocate a buffer intended to hold all data in this stream.

An example solution for your problem can be as follows:

ZipFile aZipFile = new ZipFile(fileName);   
InputStream zipInput = aZipFile.getInputStream( caImport );  
int available = zipInput.available();  
byte[] contentBytes = new byte[ available ];  
while ( available != 0 )   
{   
    zipInput.read( contentBytes );   
    // here, do what ever you want  
    available = dis.available();  
} // while available  
...   

This works for sure on all sizes of input files.

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
Ravinder Reddy
  • 23,692
  • 6
  • 52
  • 82
  • 5
    Please don't use `available` for this **at all**! It doesn't help and only makes your code more fragile. Simply read until you get no more content. – Joachim Sauer Oct 24 '11 at 11:52
  • 1
    @Joachim:So when is `available` used? – Cratylus Oct 24 '11 at 12:01
  • 1
    @user384706: The only valid use that *I* know of is if you want to check a slow stream (for example a socket) to see *if* there is any data available at all (and if nothing is available, do something else). The reason that it's not useful for anything else is it's inherently racy nature (i.e. by the time you do the actual `read()` the number might already have changed). – Joachim Sauer Oct 24 '11 at 12:03
  • @Joachim:Interesting.You know I have been using this for quite a while and I never had any issues.Maybe because the files were small?Or pure luck? – Cratylus Oct 24 '11 at 12:14
  • 2
    @user384706: many `InputStream` implementation have some "good scenarios" where they will appear to work. Small files, big files, sector-aligned files, MTU-sizes on the network. All of those might be cases where the code seems to be fine. And once one of these factors goes wrong your code suddenly produces wrong output. – Joachim Sauer Oct 24 '11 at 12:16
  • @Joachim:I see.The downside from my point of view is that it is very convenient to see how much bytes the internal byte buffer you will read into will be.But interesting information! Thank you – Cratylus Oct 24 '11 at 12:34
  • 1
    @user384706: that's a dangerous kind of "convencience": The `available` size might be an arbitrary number that has nothing to do with what you actually need. If your goal is to read the full file, then you'll need a bigger buffer. If your goal is to read as fast as possible then your buffer should be some sane, constant size to avoid having to re-allocate it and if your goal is to read "one package of information" (in some defined protocol) then you'll need to ensure that you read *exactly* that many bytes. – Joachim Sauer Oct 24 '11 at 12:35
  • @JoachimSauer: so the `available()` being an arbitrary, will it affect the behaviour when run over different VM's? how is `available()` defined different from `read()`? – Ravinder Reddy Oct 24 '11 at 12:49
  • 1
    @Ravinder: it *can* result in different behaviour. The only *relevant* difference in their definition is the word "estimate". – Joachim Sauer Oct 24 '11 at 13:07
  • Is this the best way to read it? `byte[] content = new byte[2048]; ByteArrayOutputStream bOut = new ByteArrayOutputStream(); while((n= zipInput.read(content)) != -1 ){ bOut.write(content, 0, n); }` – Cratylus Oct 24 '11 at 13:18
  • @JoachimSauer: Does it mean that `read( intoByteArrayOfSize[available] )` may return some thing less than the estimated `available()`? If yes, does it mean that the next call to `available()` will not actually start from where it skipped? and again, if yes, why should it fail looping until found zero available? because this is as similar as `read` that returns a `-1` as the end of read from stream. – Ravinder Reddy Oct 24 '11 at 13:42
  • `read()` *surely* can return less than `available` it *can also easily* read more (provided the buffer you pass in is bigger, of course). `available()` doesn't "start" anywhere. It estimates how many bytes are avaiable. It *could* under-estimate and it *could* over-estimate the number of available bytes. Therefore it should **never** be used where a *guaranteed correct* answer is required. – Joachim Sauer Oct 24 '11 at 13:46
  • @user384706: only difference I find is comparing with a `-1` after read over `available != 0`. As `read` gurantees that it returns `eos` as `-1`, I think, we can rely on `!= -1`. – Ravinder Reddy Oct 24 '11 at 14:02
  • @user384706 Pure luck. If you had been using an SSLSocket you would never have got anything from `available()` except zero. It isn't obliged to return any other value, and it is explicitly documented *not* to return the total byte remaining in the stream. As read() returns the number of bytes actually read there is no reason to call available() before it, unless you are hoping to avert blocking. – user207421 Oct 25 '11 at 01:44
  • @Ravinder The Javadoc you quoted yourself says, 'a single read or skip of this many bytes will not block'. That can only mean that it reads at least as many bytes as were returned by `available()`. – user207421 Oct 25 '11 at 01:47
  • @EJP:The way you describe it, it seems to me that `available` has no use at all.Not even the use mentioned by Joachim, to check if there are data in a slow stream – Cratylus Oct 25 '11 at 06:03
  • @user384706: you shooted my words! :) I appreciate they way JoachimSauer replied but it would have been better if given a source of documentation on the actual implementation of `available()`. We need to blame javadoc given by the implementors for copy pasting the same doc comments from the parent docs. If `available()` is just an estimate, then it should for sure be declared `deprecated`. Otherwise the documentation should be very clear on which use cases the `available()` is best to depend on. – Ravinder Reddy Oct 25 '11 at 17:52
  • @Ravinder:I personally I am splitted.From one side I find this new info on `available` very interesting.On the other side I can not help wonder how come I was so lucky and never got an issue with `available` despite I extensively used it (and I mean extensively).Additionally searching through the web, I could not find a specific example of what is the best way to handle the input stream if you want to avoid reading in byte per byte.Many sites mention `available`... – Cratylus Oct 25 '11 at 19:33
  • @user384706 You can only use it for any purpose if you know it can return something other than zero. You can't rely on that in general but there are specific circumstances where you can, such as System.in.available(), or if you know you have a plaintext socket. The way to handle any input stream is to read it and block, dedicating a thread to it, and setting a socket timeout if it's a socket. – user207421 Oct 26 '11 at 02:24
  • @Ravinder it is the Javadoc tool itself that does the copy/pasting, not the implementors. The fact that it's an estimate doesn't imply that it should be deprecated. The two are not related. – user207421 Oct 26 '11 at 02:26
  • @EJP: I dont think Javadoc tool is that intelligent. It is the instruction passed to the Javadoc tool to include doc comments from parent class-method. The instruction must be from implementors. I feel for the fact that `available` is just an estimate, the `InputStream` object should not be a shared object but private, to get correct results. – Ravinder Reddy Oct 28 '11 at 07:35
  • 1
    @Ravinder it most certainly is that intelligent. Read the Javadoc of the Javadoc tool. If you don't specify any doc-comments of your own and the source of the defining class is available its Javadoc is copied in. I don't see what difference making the `InputStream` private makes. – user207421 Oct 28 '11 at 07:58
  • @EJP: On Javadoc, my comment is that, it is the implementors responsibility to declare how reliable their implementation is. If they leave with no new comment, programmers would be keeping it away due to the fear of `estimate` only. And other fact is, I never faced a problem using `available()` on various `InputStream`'s. Those opened streams were never shared but used within the modules privately. I felt, if those objects were set shared, perhaps I might have realized the issue with `available`. – Ravinder Reddy Oct 28 '11 at 08:50
  • @Ravinder How? InoutStream isn't going to behave differently depending on whether it's public or private. It's going to behave differently depending on where the InputStream came from, which concrete class is extending it, etc. Completely different issues. – user207421 Oct 28 '11 at 09:11
  • @EJP: what is the effect of `clas1.public_isObj_of_ClsX.read(..)` and `clas2.the_same_public_IsObj_of_ClsX.read(..)` in a sequence until end of stream? – Ravinder Reddy Oct 28 '11 at 09:24
  • @Ravinder They both read the stream of course. What does that have to do with using/misusing `available()`? – user207421 Oct 29 '11 at 00:53
  • It doesn't 'work for sure'. If available() returns zero it does nothing. If available() returns the file size, it allocates a potentially gigantic buffer. It is also pointless. There's nothing wrong with using a fixed size buffer, and it certainly creates much less garbage. – user207421 Oct 30 '11 at 07:05
  • @EJP: yes. As discussed on various cases, there can be cases where available returns 0 though there exists some amount of data that be read. hence my statement 'work for sure' is wrong. I need to update my answer. But still wondering if someone want to use `available()` if it is just an estimate! Otherwise it shall be declared deprecated or obsolete. – Ravinder Reddy Nov 03 '11 at 11:42
0

The best way to do this should be as bellows:

public static byte[] readZipFileToByteArray(ZipFile zipFile, ZipEntry entry)
    throws IOException {
    InputStream in = null;
    try {
        in = zipFile.getInputStream(entry);
        return IOUtils.toByteArray(in);
    } finally {
        IOUtils.closeQuietly(in);
    }
}

where the IOUtils.toByteArray(in) method keeps reading until EOF and then return the byte array.

Adrian Liu
  • 385
  • 3
  • 13