0

I would like to process a text file (about 400 MB) in order to create a recursive parent-child-structure from the data given in each line. The data have to be prepared for a top down navigation (input: parent, output: all children and sub children). E.g. of lines to be read: (child,id1,id2,parent,id3)

132142086;1;2;132528589;132528599
132142087;1;3;132528589;132528599
132142088;1;0;132528589;132528599
323442444;1;0;132142088;132528599
454345434;1;0;323442444;132528599

132528589: is parent of 132142086,132142087,132142088
132142088: is parent of 323442444
323442444: is parent of 454345434

Given: OS windows xp, 32bit, 2GB available Memory and -Xmx1024m Here is the way I prepare the data:

HashMap<String,ArrayList<String>> hMap=new HashMap<String,ArrayList<String>>();
  while ((myReader = bReader.readLine()) != null) 
          {
             String [] tmpObj=myReader.split(delimiter);
                   String valuesArrayS=tmpObj[0]+";"+tmpObj[1]+";"+tmpObj[2]+";"+tmpObj[3]+";"+tmpObj[4];
                        ArrayList<String> valuesArray=new ArrayList<String>();
                        //case of same key
                        if(hMap.containsKey(tmpObj[3]))
                            {
                            valuesArray=(ArrayList<String>)(hMap.get(tmpObj[3])).clone();
                            }

                        valuesArray.add(valuesArrayS);
                        hMap.put(tmpObj[3],valuesArray);
                        tmpObj=null;
                        valuesArray=null;
                        }

return hMap;

After then I use a recursive function:

HashMap<String,ArrayList<String>> getChildren(input parent)

for creating the data structure needed. The plan is to let the hMap available (read only) for more than one thread using the function getChildren.
I tested this program with an input file of 90 MB and it seemed to work properly. However, running it with the real file with more than 380 MB lead to:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
I need some help in memory resource management

trincot
  • 317,000
  • 35
  • 244
  • 286
user1256982
  • 1
  • 1
  • 1
  • 1
  • possible duplicate: http://stackoverflow.com/questions/1565388/increase-heap-size-in-java – assylias Mar 08 '12 at 14:26
  • Please list the Reader implementation are using. Looks like BufferedFileReader? – nsfyn55 Mar 08 '12 at 14:27
  • BufferedReader bReader = new BufferedReader(new FileReader(fileName)); – user1256982 Mar 08 '12 at 14:33
  • @assylias: Hi, I would like to avoid reserving more than 1 GB of memory and I'm not sure if it's done with 2 GB (maximum given) – user1256982 Mar 08 '12 at 14:38
  • @assylias I am pretty sure that XP is going to shut down any meaningful attempts to increase Max heap on a 32 bit box "On most modern 32-bit Windows systems the maximum heap size will range from 1.4G to 1.6G. " – nsfyn55 Mar 08 '12 at 14:42
  • @nsfyn55 I read too quickly - had not seen he was already allocating 1GB – assylias Mar 08 '12 at 14:44
  • Something like this should really be done with a database or some other sort of meaningful storage, not with a collection, especially if you have memory constraints. – Viruzzo Mar 08 '12 at 14:49
  • The collection will be vastly faster __if__ you can do it at all and _if_ you do it right. It's worth working very hard to work in-memory rather than resorting to a DB. – RalphChapin Mar 08 '12 at 15:22

4 Answers4

2

From the "dirt-simple approach" side of things: Based on your problem statement, you don't need to keep id1, id2, or id3 around. Assuming that's the case, how about replacing your HashMap<String, ArrayList<String>> with a HashMap<Integer, ArrayList<Integer>>? You can use Integer.parseInt() to do the string-to-int conversion, and an Integer should always be smaller than the corresponding String.

Other suggestions: replace your ArrayList with a HashSet if you don't care about duplicates.

Per outofBounds' answer, you don't need to clone an ArrayList every time you want to add an item to it.

Sbodd
  • 11,279
  • 6
  • 41
  • 42
  • Hi, this lines given are just for example (sorry for not being accurate with the values). I need the information given in ArrayList> and those contain Strings.
    I do care of the duplicates when I use the recursive function getChildren.
    – user1256982 Mar 08 '12 at 15:28
2

Do check out increasing your memory, as suggested by others. Also, you can store your data within the table better as suggested by Sbodd and others.

However, you may be running afoul of memory fragmentation. Hash maps use arrays. Big hash maps use big arrays. You are not specifying the size of your hashmap, so every time it decides it needs to be bigger, it discards its old array and allocates a new one. After a while, your memory will fill up with discarded hash table arrays and you get an OutOfMemoryException even though you technically have plenty of free memory. (90% of your memory could be available, but in pieces too small to use.)

The garbage collector (GC) will work continuously to combine all these free bits into blocks big enough to use. If your program ran slowly enough, you would not have a problem, but your program is running full tilt and the GC is going to get behind. The GC will throw the exception if it cannot assemble a free block big enough fast enough; the mere fact that the memory exists will not stop it. (This means that a program that could run won't, but it keeps the JVM from running real slow and looking real bad to users.)

Given that you know how big your hash map has to be, I'd set the size up front. Even if the size isn't precisely right, it may solve your memory problem without increasing the heap size and will definitely make your program run faster (or as fast as your file read lets it--use big file buffers).

If you have no real idea how big your table might be, use a TreeMap. It's a bit slower but does not allocate huge arrays and is hence a lot kinder to the GC. I find them a lot more flexible and useful. You might even look at the ConcurrentSkipTreeMap, which is slower than the TreeMap, but lets you add and read and delete from multiple threads simultaneously.

But your best bet is something like:

hMap = new HashMap<String,ArrayList<String>>( 10000000 );
RalphChapin
  • 3,108
  • 16
  • 18
  • it would be better to have a fast scan of the file to get an upper bound estimate on the number of elements N and use that to initialize the HashMap(N). TreeMap might be a better choice. – hidralisk Mar 08 '12 at 15:21
0

Inside your While loop u can reduce some Space something like this

String [] tmpObj=myReader.split(delimiter);
// String = String + String takes more Space than String.format(...)
//String valuesArrayS=tmpObj[0]+";"+tmpObj[1]+";"+tmpObj[2]+";"+tmpObj[3]+";"+tmpObj[4];

// Just Adding if thers is no List for a Key
if(!hMap.containsKey(tmpObj[3]){
    hMap.put(tmpObj[3], new ArrayList<String>());
}
// Gettin the list from the Map and adding the new stuff
List<String> values = hMap.get(tmpObj[3]);
values.add(String.format("%s;%s;%s;%s;%s",tmpObj[0], tmpObj[1], tmpObj[2], tmpObj[3], tmpObj[4]));

no need to Clone the List

outofBounds
  • 594
  • 6
  • 18
0

You are really testing the boundaries of what one can doing with 1GB of memory.

You could:

  1. Increase Heap Space. 32 bit windows will limit you to ~1.5GB, but you still have a little more wiggle room it might be enough to put you over the top.
  2. Build some type of pre-processor utility that Pre-partitions the file in sizes you know to work and operates on them one at a time, perhaps hierachically.
  3. Try re-structuring your program. It has lots of splitting and concatenating going on. In java strings are immutable and when you split strings and concatenate with + operators you are creating new Strings all the time(9 out of 10 cases this doesn't matter, but in your case where you are working with a very limited set of resources it might make a difference)

As a side less helpful note. The real issue here is that you just don't have the resources to tackle this task and optimization is only going to take you so far. Its like asking how to better tunnel through a mountain with a garden trowel. The real answer is probably the one you don't want to hear which is throw away the trowel and invest in some industrial equipment

On a second more helpful note(and fun if you're like me) - you may try hooking jVisualVM up to your application and trying to understand where you heap is going or use jhat and the -XX:+HeapDumpOnOutOfMemoryError jvm flag to see what was happening with the heap at crash time.

nsfyn55
  • 14,875
  • 8
  • 50
  • 77
  • Hi, I was about to start pre-processing the file, but since I'm using a recursive function and the structure to be extracted may contain exceptions like(the parent could be a grandchild of himself), I hope to find another solution. This option is still open for me :). – user1256982 Mar 08 '12 at 15:00
  • Point 3. Did you mean something like, instead of: `String [] tmpObj=myReader.split(delimiter); String valuesArrayS=tmpObj[0]+";"+tmpObj[1]+";"+tmpObj[2]+";"+tmpObj[3]+";"+tmpObj[4];` **this:** `valuesArrayS=tmpObj[0]+";"+myReader;` – user1256982 Mar 08 '12 at 15:05