5

So this is in relationship to dealing with the Large Object Heap and trying to minimize the number of times I instantiate a byte[]. Basically, I'm having OutOfMemoryExceptions and I feel like it's because we're instantiating too many byte array's. The program works fine when we process a couple of files, but it needs to scale, and it currently can't.

In a nutshell, I've got a loop that pulls documents from a database. Currently, it's pulling one document at a time and then processing the document. Documents can range from less than a meg to 400+ megs. (hence why i'm processing one at a time). The following is pseudo-code and before I've optimized.

So the steps I'm doing are:

  1. Make a call to the database to find the largest file size (and then multiplying it by 1.1)

    var maxDataSize = new BiztalkBinariesData().GetMaxFileSize();
    maxDataSize = (maxDataSize != null && maxDataSize > 0)
        ? (long)(maxDataSize * 1.1)
        : 0;
    var FileToProcess = new byte[maxDataSize];
    
  2. Then I make another database call pulling all of the documents (without data) from the database and place these into an IEnumerable.

    UnprocessedDocuments =
        claimDocumentData.Select(StatusCodes.CurrentStatus.WaitingToBeProcessed);
    foreach (var currentDocument in UnprocessDocuments)
    {
         // all of the following code goes here
    }
    
  3. Then I populate my byte[] array from an external source:

    FileToProcess = new BiztalkBinariesData()
        .Get(currentDocument.SubmissionSetId, currentDocument.FullFileName);
    
  4. Here is the question. It would be much cleaner to pass the currentDocument (IClaimDocument) to other methods to process. So if I set the data part of the currentDocument to the pre-formatted array, will this use the existing reference? Or does this create a new array in the Large Object Heap?

    currentDocument.Data = FileToProcess;
    
  5. At the end of the loop, I would then clear FileToProcess

    Array.Clear(FileToProcess, 0, FileToProcess.length);
    

Was that clear? If not, I'll try to clean it up.

Yahia
  • 69,653
  • 9
  • 115
  • 144
Cyfer13
  • 369
  • 7
  • 17
  • 13
    Don't use feelings to figure out culprits. Use a memory profiler. – Oded Jan 31 '12 at 15:47
  • Is `UnprocessedDocuments` or its item implements `IDisposable`? – sll Jan 31 '12 at 15:52
  • 1
    *Where* are you getting the `OutOfMemoryException`? Is there a relatively consistent place where this exception gets thrown, or is it seemingly random? – Branko Dimitrijevic Jan 31 '12 at 15:55
  • UnprocessDocuments is IEnumerable (IClaimDocument is a data object (strings, ints, etc). – Cyfer13 Jan 31 '12 at 15:55
  • As for a memory profiler, I've never used one before and think it's a great idea. I downloaded a couple the other day, but was really overwhelmed. Any advice for a beginner? – Cyfer13 Jan 31 '12 at 15:56
  • before using a memory profiler i would check if fragmentation of the LOH is the issue here. see http://msdn.microsoft.com/en-us/magazine/cc163528.aspx#S8 – Jf Beaulac Jan 31 '12 at 16:00
  • definitely an area where I need to learn more about. That article was a challenge for me to read. I understand that it's important topic to know, but feel like I need more of a better intro to fragmentation. – Cyfer13 Jan 31 '12 at 16:03
  • At what point do you get rid of the data contained within FileToProcess before you intialize it to a byte array of a different size? You seem to think that the process of creating a byte array is the problem, and it could be you are dealing with large files, and basically putting them in memory. I just noticed that BrokenGlass makes mention of this. All you have to do is create a byte array once that is large enough for the largest supported file. What you have now does not rid of the old byte array right away, while its not being used by your process, its unlikely to be usable by another. – Security Hound Jan 31 '12 at 16:15
  • 1
    @Cyfer13 - I wanted to make mention that this is the best pseudo-code I have seen in years. Your question is very clear, you provided enough information to understand what you understand, and provided pseudo-code that while not perfect explains the problem your running into. If I could give you a second voteup I would. – Security Hound Jan 31 '12 at 16:19
  • Step 1 is where I want to create the initial array of the largest size. But it looks like I'm replacing the reference from that original file with a new reference, which is basically making that initial step useless. – Cyfer13 Jan 31 '12 at 16:24

5 Answers5

6

Step 1:

var FileToProcess = new byte[maxDataSize];

Step 3:

FileToProcess = new BiztalkBinariesData()
    .Get(currentDocument.SubmissionSetId, currentDocument.FullFileName);

Your step 1 is completely unnecessary, since you re-assign the array in step 3 - you are creating a new array, you do not populate the existing array - So essentially step 1 is just creating more work for the GC, which if you do it in quick order (and if it is not optimized away by the compiler, which is entirely possible) might explain some of the memory pressure you are seeing.

BrokenGlass
  • 158,293
  • 28
  • 286
  • 335
  • So according to this, no matter what I do, the my BiztalkBinariesData().Get() method will be creating a new byte[]? My code currently is using this: object retValue = MedicareDatabase.Db.ExecuteScalar(storedProcedure); if (retValue == DBNull.Value || retValue == null) { ret = default(T); } else { ret = retValue.ConvertTo(); } – Cyfer13 Jan 31 '12 at 15:58
  • 2
    Was just about to post this myself. In addition, the line "Array.Clear(FileToProcess, 0, FileToProcess.length);" doesn't clear up anything. It doesn't deallocate any memory, it just sets the values to be 0 (but still allocated). It's only ever doing (quite a bit) of useless work. Delete this line. – Servy Jan 31 '12 at 15:59
  • 2
    @Cyfer13 Yes, it is creating a new byte array each time, which is where the problems are coming from. If you were re-using the same array created at the start it wouldn't ever take up more than that amount of memory. The array itself is allocated in the database's execute method. – Servy Jan 31 '12 at 16:02
  • so replace object retValue = .... with my FileToProcess = (byte[])MedicareDatabase.Db.ExecuteScalar(...)? – Cyfer13 Jan 31 '12 at 16:05
  • 1
    No, adding a cast in there wouldn't change anything. I doubt there is any way for you to alter the method in such a way as to be able to re-use the array. That's why we've been saying that creating a new array is pointless; you can't stop the database adapter from creating it's own arrays. When you say 'FileToProcess = ...' you aren't populating the FileToProcess array with the data in the array on the right. FileToProces just points to an array, and you're changing the array that it points to from whatever it was before to the new array returned by the Execute method. (...) – Servy Jan 31 '12 at 16:12
  • @Cyfer13: No, that won't change anything - the only way to re-use the same array is to pass the array to the method interacting with the database and then using some low level call to populate it, i.e. `SqlDataReader.GetBytes()` – BrokenGlass Jan 31 '12 at 16:13
  • Hmm.....this puts me in a predicament. I think that is why I was going to try and repopulate the same array over and over again. So instead of changing references, I'd be using the same reference, just with different data. But it looks like I might be able to reduce the number of arrays created, but won't be able to limit them to just a few. I'll definitely have one per database call. – Cyfer13 Jan 31 '12 at 16:15
  • 1
    (Continued) Each time you assign a new array to FileToProcess the previous array is just getting thrown out (since you're not holding onto it elsewhere). When you assign FileToProcess with the first file you are throwing out the array you create at the start. When you assign the second file you are throwing out the array from the first file. Once the array is no longer reference the garbage collector will eventually come along and free up that memory, but that only happens every so often. Since you're 'throwing out' a lot of large arrays continuously it's eating up a lot of memory. – Servy Jan 31 '12 at 16:16
  • @Servy You've hit the nail on the head in a much clearer way then I could say. With that explanation, I guess I need to find a way to refill the array with new data instead of creating a new reference. Is this possible? – Cyfer13 Jan 31 '12 at 16:27
  • would the MSDN example in this post be what i'm looking for to refill the byte array? If I populated it with the maximum size, and then passed it to this method, (removed the new byte[]), then it might be able to work.... http://stackoverflow.com/questions/5371222/getting-binary-data-using-sqldatareader – Cyfer13 Jan 31 '12 at 16:51
  • 1
    While I haven't worked with SqlDataReader much, it looks like it's doing what you want it to be doing here, yes. You should also follow the model there of filling a (rather small) buffer and looping until you've gone through the whole file rather than trying to get the bytes for the whole file all at once. That will significantly shrink the memory footprint of your program to not need to allocate a 500MB byte array. – Servy Jan 31 '12 at 17:04
  • i'd still need to allocate the 500mb byte array as I would need to pass that into the database call....correct? And since it's passed by reference, I don't really need to return anything from the database call. Basically wrapping the code on that method with: private void DatabaseCall(byte[] fileToProcess). Then Step 4 I would still do, because this sets the reference on my object, and then I can pass the object around. Someone mentioned not to do Step 5, but if I'm refilling an array, wouldn't it be better if the array was 'empty' before I refilled it? – Cyfer13 Jan 31 '12 at 17:11
  • 1
    So one thing that you've left out until now (rightfully so) is what you're doing with the data once you have it into a byte array. The primary question here is, "Can it be processed in chunks, or do you need it all in one big array." We can help answer that if you don't know. Being able to use chunks would be best. The idea is that you would read, say, 1MB of the file, process it, read the next 1MB, process it, when done with that file, read 1MB of the next file, and so on. (Re-using the same 1MB byte array.) The link that you provided shows how to grab chunks of data. (...) – Servy Jan 31 '12 at 17:37
  • 1
    (Continued) If you can't use chunks and you need one big byte array then you still don't need to clear it out. In addition to the array you should hold onto the actual length of the file. With that, you know that everything after that index is garbage (i.e. leftovers from a previous file) and you shouldn't be using it. This also means you won't be processing a *bunch* of empty bytes at the end of each file – Servy Jan 31 '12 at 17:40
  • Can't use the chunks. Basically it's a pdf that we're splitting into smaller pdf's based on the size of the pages within. So if we have the length of the file, then I would use currentDocument.Data.Take(fileLength). So in the case of sending it to iTextSharp, it would be reader = new PdfReader(currentDocument.Data.Take(fileLength); – Cyfer13 Jan 31 '12 at 18:13
5

Arrays are reference types and as such you will be passing a copy of the reference, not a copy of the array itself. That would only be true with value types.

This simple snippet illustrates how arrays behave as reference types:

public void Test()
{    
    var intArray = new[] {1, 2, 3, 4};
    EditArray(intArray);
    Console.WriteLine(intArray[0].ToString()); //output will be 0
}

public void EditArray(int[] intArray)
{
    intArray[0] = 0;
}
InBetween
  • 32,319
  • 3
  • 50
  • 90
  • What you describe is only true if the **ref** isn't used. Its not clear that this isn't the case, the sample code only attempts to explain the problem, best pseudo-code I have seen in years. – Security Hound Jan 31 '12 at 16:18
  • @Ramhound: `ref` does not change the fact that arrays are passed by reference. The only difference is if the reference is passed as a copy or not. The behavior I describe in this post is true be it `ref` or not. – InBetween Jan 31 '12 at 16:27
2

It will use the existing reference, don't worry. The array's contents aren't copied.

Ry-
  • 218,210
  • 55
  • 464
  • 476
2

Your problem could be in the implementation and usage of the "BiztalkBinariesData" class.

I'm not sure how it is implemented but I do see you're declaring a new instance each time

new BiztalkBinariesData()

Something to think about..

mikey
  • 5,090
  • 3
  • 24
  • 27
  • this was pseudo code. I normally implement the object once, and then call the Get() in the loop. – Cyfer13 Jan 31 '12 at 16:01
  • as was pointed out to me in a different post, you are correct. The Get() method is currently instantiating a new object instead of using the pre-existing reference. – Cyfer13 Jan 31 '12 at 16:10
1

I'm having OutOfMemoryExceptions and I feel like it's because we're instantiating too many byte array's

No, it is because yo allocate LARGE arrays. Limit them to like 48kb or 64kb and "combine" them with a custom container. 64kb means you can take the higher 2 bytes of the index to determine the array to use. The container contains an arrays of arrays. Handling very large objects leads to fragemntation and the inability to allocate one large array later down.

TomTom
  • 61,059
  • 10
  • 88
  • 148