3

I have a program that creates an object for each file in a directory (sub)tree. In these days of larger and larger disks, there is no way to know how many files that will be, esp. not a few years (months?) from now.

My program is not enterprise-critical; it is a tool for a user to analyze that subtree. So it is acceptable to tell the user that there is not enough memory in this environment to operate on that subtree. He could possibly do what he wants by choosing subtrees of that subtree.

But it is not acceptable for the program to just die, or throw a stacktrace, or other things only a programmer can love. I would like the program to give the user some reasonable feedback and let him control what he does about it.

I have read a number of the posts here on StackOverflow about OOM exceptions, and in the main I agree with a number of points: badly designed apps, memory leaks, etc., are all problems that need to be thought of. But in this case, I might have had somebody attempt to use my tool on a 10T disk that just has more files than the program prepared to analyze. And I'm not trying to write the tool so that it operates on every possible subtree.

I have seen suggestions that OOM can just be caught "like any other exception"; unfortunately, this is not a robust way to do things. When OOM gets thrown, some thread is likely to have died already, and we cannot tell which one it will be, and we can't restart it. So if it happens to be one critical to Swing, for instance, then we are out of luck.

So my current thinking is that my program will need to take occasional looks (at least) at the amount of free memory available and stop itself if that gets below some threshold. I can test things to determine a threshold that allows me to output a dialog box with a message and then wipe all my references to my objects.

But if I'm missing something, or there's a better way to go about things, I'd like to know it.

arcy
  • 12,845
  • 12
  • 58
  • 103
  • 1
    i would have personally used something like lucene and indexed the file system to disk. There's no reason I can see it should be in memory. Alternatively, if you do need a huge memory based index, thats what you need and you need to allocate heap accordingly – aishwarya Dec 18 '11 at 04:17
  • @aishwarya: I thought the same thing when reading the question. +1 – DaveFar Dec 18 '11 at 08:35
  • 1
    The problem with checking the free memory is that it only shows the known free memory, not how much would be freed should GC run. This means you could potentially stop processing on low memory when a lot of memory is available as soon as GC runs. Would that be acceptable? – Roger Lindsjö Dec 18 '11 at 10:21
  • The better way of doing things is to ensure you are using a limited amount of memory regardless of the amount of work you need to do. There is no reason IMHO why analysing a 10 TB disk should use more memory than analysing a 1 TB disk. Once you solve the problem of ensuring your memory doesn't increase significantly, you don't have this issue. – Peter Lawrey Dec 18 '11 at 10:25
  • @aishwarya putting things on disk (through another tool or my own code) is something to consider, though of course one expects that to be very much slower than in-memory, and it is a user tool where response time is an issue. I could declare the program limited to avail memory (leaving me with the original problem), or use disk and declare response time less important. If I want to get sophisticated I could have it switch when memory gets low (again leaving me with part of the original problem). – arcy Dec 18 '11 at 11:25
  • @rcook, given the nature of the problem, i would think the compromise between response time and ability to work in most conditions is an interesting one :-) lucene is very optimised for performance and there's not much of a difference you would see with in memory storage. Plus it has in memory support too much more optimised that i can casually write :) eagerly fetching memory and defining application behaviour - thats GCs job and its not going to be easy to replicate. – aishwarya Dec 19 '11 at 04:33

2 Answers2

1

See my post here. Why not calculate freeMemory as you're populating the tree, and stop at some (possibly user configurable) point like: 90% heap occupied. You should really try to keep the object you create for each file as small as possible. Can you paste the code for this data structure so we can critique it and see if it can be made smaller? Maybe you don't need to have an object directly, but rather a proxy object that can get the relevant information upon request.

Community
  • 1
  • 1
Amir Afghani
  • 37,814
  • 16
  • 84
  • 124
1

Here's not a direct answer, just my 2 cents of what approach I would take instead:

As a user of such a program (or api), I'd be happy to get feedback / estimates / control as early as possible, for instance:

  • get a warning such as "There are 42 fantastillion many files in the given directory, which will require about 5 hours of processing and 2GB of memory"
  • being able to set the maximum amount of files (maybe with a filter to get the most relevant files) to be processed, or the maximum recursive depth of traversed subtrees
  • having the program do a breadth-first-search (instead of a depth-first search) through the directory structure and in the case of a too large directory giving the user the option to abort the analyzation alltogether, or just ignoring the current directory, or just ignoring the current subtree.

I would also find it acceptable to wait some time for a preprocessing if I got the appropriate feedback, like "Preprocessing: traversing subdirectories to estimate required time and memory" and good estimates afterwards.

With reasonable estimation, I don't think you would need a sophisticated memory usage monitoring. If OOM exceptions still occur more then seldomly, I would rather follow aishwarya's approach and write to disk instead of holding all the information in memory.

DaveFar
  • 7,078
  • 4
  • 50
  • 90