"Where are my bytes?" or Investigation of file length traits

Question

This is a continuation of my question about downloading files in chunks. The explanation will be quite big, so I'll try to divide it to several parts.

1) What I tried to do?

I was creating a download manager for a Window-Phone application. First, I tried to solve the problem of downloading large files (the explanation is in the previous question). No I want to add "resumable download" feature.

2) What I've already done.

At the current moment I have a well-working download manager, that allows to outflank the Windows Phone RAM limit. The plot of this manager, is that it allows to download small chunks of file consequently, using HTTP Range header.

A fast explanation of how it works:

The file is downloaded in chunks of constant size. Let's call this size "delta". After the file chunk was downloaded, it is saved to local storage (hard disk, on WP it's called Isolated Storage) in Append mode (so, the downloaded byte array is always added to the end of the file). After downloading a single chunk the statement

if (mediaFileLength >= delta) // mediaFileLength is a length of downloaded chunk

is checked. If it's true, that means, there's something left for download and this method is invoked recursively. Otherwise it means, that this chunk was last, and there's nothing left to download.

3) What's the problem?

Until I used this logic at one-time downloads (By one-time I mean, when you start downloading file and wait until the download is finished) that worked well. However, I decided, that I need "resume download" feature. So, the facts:

3.1) I know, that the file chunk size is a constant.

3.2) I know, when the file is completely downloaded or not. (that's a indirect result of my app logic, won't weary you by explanation, just suppose, that this is a fact)

On the assumption of these two statements I can prove, that the number of downloaded chunks is equal to (CurrentFileLength)/delta. Where CurrentFileLenght is a size of already downloaded file in bytes.

To resume downloading file I should simply set the required headers and invoke download method. That seems logic, isn't it? And I tried to implement it:

    // Check file size
    using (IsolatedStorageFileStream fileStream = isolatedStorageFile.OpenFile("SomewhereInTheIsolatedStorage", FileMode.Open, FileAccess.Read))
    {
      int currentFileSize = Convert.ToInt32(fileStream.Length);
      int currentFileChunkIterator = currentFileSize / delta;
    }

And what I see as a result? The downloaded file length is equal to 2432000 bytes (delta is 304160, Total file size is about 4,5 MB, we've downloaded only half of it). So the result is approximately 7,995. (it's actually has long/int type, so it's 7 and should be 8 instead!) Why is this happening? Simple math tells us, that the file length should be 2433280, so the given value is very close, but not equal.

Further investigations showed, that all values, given from the fileStream.Length are not accurate, but all are close.

Why is this happening? I don't know precisely, but perhaps, the .Length value is taken somewhere from file metadata. Perhaps, such rounding is normal for this method. Perhaps, when the download was interrupted, the file wasn't saved totally...(no, that's real fantastic, it can't be)

So the problem is set - it's "How to determine number of the chunks downloaded". Question is how to solve it.

4) My thoughts about solving the problem.

My first thought was about using maths here. Set some epsilon-neiborhood and use it in currentFileChunkIterator = currentFileSize / delta; statement. But that will demand us to remember about type I and type II errors (or false alarm and miss, if you don't like the statistics terms.) Perhaps, there's nothing left to download. Also, I didn't checked, if the difference of the provided value and the true value is supposed to grow permanently or there will be cyclical fluctuations. With the small sizes (about 4-5 MB) I've seen only growth, but that doesn't prove anything.

So, I'm asking for help here, as I don't like my solution.

5) What I would like to hear as answer:

What causes the difference between real value and received value?

~~Is there a way to receive a true value?~~

~~If not, is my solution good for this problem?~~

~~Are there other better solutions?~~

P.S. I won't set a Windows-Phone tag, because I'm not sure that this problem is OS-related. I used the Isolated Storage Tool to check the size of downloaded file, and it showed me the same as the received value(I'm sorry about Russian language at screenshot):

File size is wrong image

Are you downloading the file in binary or text mode? Depending on OS new line handling you'll get different sizes - `\r\n` and `\n`, etc... Another thought: how are you determining the total length of the file being downloaded - from the HTTP headers? OS cluster size and the method used to determine file size might be playing a role in what you're seeing. That the difference between downloaded and received is exactly 1024 bytes is ringing some bells here. — Anthill, Feb 13 '13 at 08:02
Maybe there's something wrong in your append logic? As a test, you can save each chunck to a separate file, and see if the file lengths match up in that case. To me it also sounds like a better system: Download chunks, and combine to one when everything is done (But that's just my first-idea-didn't-give-it-that-much-thought feeling). — Willem van Rumpt, Feb 13 '13 at 08:08
@Anthill, everything works in binary mode. The difference between files is always different, but small, I also tried to find a regularity. Didn't find :). — Olter, Feb 13 '13 at 08:13
@WillemvanRumpt, the downloaded file is a media file, so theoretically, if something was wrong with the "append", I might have noticed it during listening(the difference is very small, yeah). And as I see, everything downloaded works ok, and also, if the file was fully downloaded, I can't see any trouble there. — Olter, Feb 13 '13 at 08:16
When you read from a stream into a buffer, that buffer isn't always filled completely (the Read method returns the actual number). So if you do write the complete buffer, you end up with files that are too big (and corrupted). — Hans Keﬆing, Feb 13 '13 at 08:17
2432000 / 304160 = 7.9957916885... meaning: you recieved 7 full chunks x304160 = 2129120 + last chunk 0.9957916885... x 304160 = 302880 Sum: 2129120+302880 = 2432000 Can you explain the simple math that presumably suggest the file should be 2433280? (and not 2432000) — G.Y, Feb 13 '13 at 08:35
@G.Y, 2433280 suggests, that the all 8 chunks have been downloaded totally(as it was assumed) and that may be incorrect. — Olter, Feb 13 '13 at 10:33
Are you flushing your buffers when the download is paused/aborted or whatever causes it to require resuming at some later point? It sounds to me that some small part of the last chunk is not written to the file; this part may be left over in the StreamWriter buffer. — Jeff-Meadows, Feb 13 '13 at 19:21
If you understand the problem best, then you should answer it and accept that answer! — Jeff-Meadows, Feb 14 '13 at 07:25
If the problem is solved, then mark appropriate answer as accepted. If there is no such answer then answer your own question, and mark that answer as accepted. — Dialecticus, Feb 19 '13 at 16:37

score 0 · Answer 1 · answered Feb 13 '13 at 11:59

I'm answering to your update:

This is my understanding so far: The length actually written to the file is more (rounded up to the next 1KiB) than you actually wrote to it. This causes your assumption of "file.Length == amount downloaded" to be wrong.

One solution would be to track this information separately. Create some meta-data structure (which can be persisted using the same storage mechanism) to accurately track which blocks have been downloaded, as well as the entire size of the file:

[DataContract] //< I forgot how serialization on the phone works, please forgive me if the tags differ
struct Metadata
{
     [DataMember]
     public int Length;
     [DataMember]
     public int NumBlocksDownloaded;
}

This would be enough to reconstruct which blocks have been downloaded and which have not, assuming that you keep downloading them in a consecutive fashion.

edit

Of course you would have to change your code from a simple append to moving the position of the stream to the correct block, before writing the data to the stream:

 file.Position = currentBlock * delta;
 file.Write(block, 0, block.Length);

if you look at the numbers, you can see, that the size of file actually written to the file is **less** than it should be(and you're telling the opposite). I suppose, that means, that the chunk wasn't fully saved. Maybe I'm wrong, but I've checked my solution, I've just written, and it seems, that the last chunk is really not totally saved. — Olter, Feb 13 '13 at 12:07

score 0 · Answer 2 · edited Aug 12 '15 at 12:37

In continue to my comment..

The original file size as I understand from your description is 2432000 bytes.
The Chunk size is set to 304160 bytes (or 304160 per "delta").

So, the machine which send the file was able to fill 7 chunks and sent them.
The receiving machine now has 7 x 304160 bytes = 2129120 bytes.

The last chunk will not be filled to the end as there is not enough bytes left to fill to it.. so it will contain: 2432000 - 2129120 = 302880 which is less than 304160

If you add the numbers you will get 7x304160 + 1x302880 = 2432000 bytes So according to that the original file transferred in full to the destination.

The problem is that you are calculating 8x304160 = 2433280 insisting that even the last chunk must be filled completely - but with what?? and why??

In humble.. are you locked in some kind of math confusion or did I misunderstand your problem?
Please answer, What is the original file size and what size is being received at the other end? (totals!)

Actually, you misunderstood the explanation a bit. The primary point is that I was working with "resume download" feature, so the total length of file is more, that 2432000. In example it was about **4,5 MB**. So, when I talked about 8 chunks, I know, that the file shouldn't be totally downloaded (as the downloaded part is about 2 MB and the whole file is 4,5). That's why, I supposed, that downloaded part should be (delta * 8) size. P.S. Yeah, I've read my explanation once again, and it's said, that "I know, when the file is completely downloaded or not.", but I didn't mentioned the real size — Olter, Feb 14 '13 at 10:50

score 0 · Answer 3 · answered Feb 18 '13 at 19:19

Just as a possible bug. Dont forget to verify if the file was modified during requests. Specialy during long time between ones, that can occor on pause/resume. The error could be big, like the file being modified to small size and your count getting "erronic", and the file being the same size but with modified contents, this will leave a corrupted file.

score 0 · Accepted Answer · answered Feb 20 '13 at 12:43

Have you heard an anecdote about a noob-programmer and 10 guru-programmers? Guru programmers were trying to find an error in his solution, and noob had already found it, but didn't tell about it, as it was something that stupid, we was afraid to be laughed at.

Why I remembered this? Because the situation is similar.

The explanation of my question was very heavy, and I decided not to mention some small aspects, that I was sure, worked correctly. (And they really worked correctly)

One of this small aspects, was the fact, that the downloaded file was encrypted via AES PKCS7 padding. Well, the decryption worked correctly, I knew it, so why should I mention it? And I didn't.

So, then I tried to find out, what exactly causes the error with the last chunk. The most credible version was about problems with buffering, and I tried to find, where am I leaving the missing bytes. I tested again and again, but I couldn't find them, as every chunk was saving without any losses. And one day I comprehended:

~~There is no spoon~~

There is no error.

What's the point of AES PKCS7? Well, the primary one is that it makes the decrypted file smaller. Not much, only at 16 bytes. And it was considered in my decryption method and download method, so there should be no problem, right?

But what happens, when the download process interrupts? The last chunk will save correctly, there will be no errors with buffering or other ones. And then we want to continue download. The number of the downloaded chunks will be equal to currentFileChunkIterator = currentFileSize / delta;

And here I should ask myself: "Why are you trying to do something THAT stupid?"

"Your downloaded one chunk size is not delta. Actually, it's less than delta". (the decryption makes chunk smaller to 16 bytes, remember?)

The delta itself consists of 10 equal parts, that are being decrypted. So we should divide not by delta, but by (delta - 16 * 10) which is (304160 - 160) = 304000.

I sense a rat here. Let's try to find out the number of the downloaded chunks:

2432000 / 304000 = 8. Wait... OH SHI~

So, that's the end of story.

The whole solution logic was right.

The only reason it failed, was my thought, that, for some reason, the downloaded decrypted file size should be the same as the sum of downloaded encrypted chunks.

And, of course, as I didn't mention about the decryption(it's mentioned only in previous question, which is only linked), none of you could give me a correct answer. I'm terribly sorry about that.

"Where are my bytes?" or Investigation of file length traits

4 Answers4