Copying a part of a byte[] array into a PDFReader

Question

This is a continuation of the ongoing struggle to reduce my memory load mention in How do you refill a byte array using SqlDataReader?

So I have a byte array that is a set size, for this example, I'll say new byte[400000]. Inside of this array, I'll be placing pdf's of different sizes (less than 400000).

psuedo code would be:

public void Run()
{
    byte[] fileRetrievedFromDatabase = new byte[400000];
    foreach (var document in documentArray)
    {
        // Refill the file with data from the database
        var currentDocumentSize = PopulateFileWithPDFDataFromDatabase(fileRetrievedFromDatabase);

        var reader = new iTextSharp.text.pdf.PdfReader(fileRetrievedFromDatabase.Take((int)currentDocumentSize ).ToArray());
        pageCount = reader.NumberOfPages;
        // DO ADDITIONAL WORK
    } 
}

private int PopulateFileWithPDFDataFromDatabase(byte[] fileRetrievedFromDatabase)
{
    // DataAccessCode Goes here
    int documentSize = 0;
    int bufferSize = 100;                   // Size of the BLOB buffer.
    byte[] outbyte = new byte[bufferSize];  // The BLOB byte[] buffer to be filled by GetBytes.

    myReader = logoCMD.ExecuteReader(CommandBehavior.SequentialAccess);

    Array.Clear(fileRetrievedFromDatabase, 0, fileRetrievedFromDatabase.Length);

    if (myReader == null)
    {
        return;
    }

    while (myReader.Read())
    {
        documentSize = myReader.GetBytes(0, 0, null, 0, 0);

        // Reset the starting byte for the new BLOB.
        startIndex = 0;

        // Read the bytes into outbyte[] and retain the number of bytes returned.
        retval = myReader.GetBytes(0, startIndex, outbyte, 0, bufferSize);

        // Continue reading and writing while there are bytes beyond the size of the buffer.
        while (retval == bufferSize)
        {
            Array.Copy(outbyte, 0, fileRetrievedFromDatabase, startIndex, retval);

            // Reposition the start index to the end of the last buffer and fill the buffer.
            startIndex += retval;
            retval = myReader.GetBytes(0, startIndex, outbyte, 0, bufferSize);
        }
    }

    return documentSize;
}

The problem with the above code is that that I keep getting a "Rebuild trailer not found. Original Error: PDF startxref not found" error when I try to access the PDF Reader. I believe it's because the byte array is too long and has trailing 0's. But since I'm using the byte array so that I'm not continuously building new objects on the LOH, I need to do this.

So how do I get just the piece of the Array that I need and send it to the PDFReader?

Updated

So I looked at the source and realized I had some variables from my actual code that was confusing. I'm basically reusing the fileRetrievedFromDatabase object in each iteration of the loop. Since it's passed by reference, it gets cleared (set to all zero's), and then filled in the PopulateFileWithPDFDataFromDatabase. This object is then used to create a new PDF.

If I didn't do it this way, a new large byte array would be created in every iteration and the Large Object Heap gets full and eventually throws an OutOfMemory exception.

Kiril · Answer 1 · 2012-02-02T20:46:58.917

1

You have at least two options:

Treat your buffer like a circular buffer with two indexes for starting and ending position. need an index of the last byte written in outByte and you have to stop reading when you reach that index.
Simply read the same number of bytes as you have in your data array to avoid reading into the "unknown" parts of the buffer which don't belong to the same file.

In other words, instead of having bufferSize as the last parameter, have the data.Length.

// Read the bytes into outbyte[] and retain the number of bytes returned.
retval = myReader.GetBytes(0, startIndex, outbyte, 0, data.Length);

If data length is 10 and your outbyte buffer is 15, then you should only read the data.Length not the bufferSize.

However, I still don't see how you're reusing the outbyte "buffer", if that's what you're doing... I'm simply not following based on what you've provided in your answer. Maybe you can clarify exactly what is being reused.

edited Feb 02 '12 at 20:46

answered Feb 02 '12 at 20:31

Kiril

39,672
31
167
226

appreciate the quick response, but I'm not seeing where I read past the buffer size. – Cyfer13 Feb 02 '12 at 20:34
I tried to update my answer, but I guess you were quicker to comment than I was to update. I don't see where you actually put 'data' into 'outbyte', but when you do, you have to make sure that you only read up to the `data.Length` instead of the `bufferSize`, because based on your explanation it seems that `data.Length` should always be less than the `bufferSize`. Furthermore, since you're reusing the buffer, files of different size may result in "junk" in the buffer from an older and larger file and that may "corrupt" your file. – Kiril Feb 02 '12 at 20:42
so for the line that you're mentioning, I don't think there is a way to get the length of the data, since that is why we're reading it in the first place and then returning how much data was actually read and populated. I know for a fact that there are going to be trailing zero's each time the byte[] is filled because all pdf's will be smaller than the largest size. Largest size is determined by the largest pdf * 1.1. – Cyfer13 Feb 02 '12 at 20:44
according to that logic, the only one that should have garbage data is the last. (I'm updating the code to add the buffer size). And I handle that by only copying in retval worth of bytes as opposed to the buffersize. – Cyfer13 Feb 02 '12 at 20:47
It's still not clear to me which buffer you're reusing, but anyway, you seem to be on the right track. There will be garbage every time you read a smaller file than what you read last. Say you read a file with 150K, then you read one with 120K, you'll have 30K of garbage. Suppose then you read another one that's 140K, then you'll still have 10K of garbage and so on. – Kiril Feb 02 '12 at 20:57
Exactly. I clear the array back to all zero's, but even those are garbage. That is why when I create the PDFReader, I don't just sent it the full array, but I try to only take part of the array. But apparently, I am doing something wrong since it's not working. – Cyfer13 Feb 02 '12 at 21:04
The outbyte buffer isn't being reused outside of the while loop. It's being used to fill the fileRetrievedFromDatabase array. The fileRetrievedFromDatabase is the object being reused in every iteration. – Cyfer13 Feb 02 '12 at 21:29

score 1 · Accepted Answer · answered Feb 02 '12 at 22:03

1

Apparently, I the way the while loop is currently structured, it wasn't copying the data on it's last iteration. Needed to add this:

if (outbyte != null && outbyte.Length > 0 && retval > 0)
{
    Array.Copy(outbyte, 0, currentDocument.Data, startIndex, retval);
}

It's now working, although I will definitely need to refactor.

answered Feb 02 '12 at 22:03

Cyfer13

369
7
17

And bonus is that the if the byte has zero's at the end, the PDFReader just ignores them. – Cyfer13 Feb 02 '12 at 22:11

Copying a part of a byte[] array into a PDFReader

2 Answers2