3

I have implemented a java program . This is basically a multi threaded service with fixed number of threads. Each thread takes one task at a time, create a hashSet , the size of hashset can vary from 10 to 20,000+ items in a single hashset. At end of each thread, the result is added to a shared collection List using synchronized.

The problem happens is at some point I start getting out of memory exception. Now after doing bit of research, I found that this memory exception occurs when GC is busy clearing the memory and at that point it stops the whole world to execute anything.

Please give me suggestions for how to deal with such large amount of data. Is Hashset a correct datastructure to be used? How to deal with memory exception, I mean one way is to use System.GC(), which is again not good as it will slow down the whole process. Or is it possible to dispose the "HashSet hsN" after I add it to the shared collection List?

Please let me know your thoughts and guide me for wherever I am going wrong. This service is going to deal with huge amout of data processing.

Thanks

//business object - to save the result of thread execution

public class Location{

    integer taskIndex;
    HashSet<Integer> hsN;
}



//task to be performed by each thread


public class MyTask implements Runnable {


    MyTask(long task) {
        this.task = task;
    }

    @Override
    public void run() {
              HashSet<Integer> hsN = GiveMeResult(task);//some function calling which returns a collection of integer where the size vary from 10 to 20000

        synchronized (locations) {
            locations.add(task,hsN);
        }
    }
}


public class Main {

    private static final int NTHREDS = 8;
    private static List<Location> locations;

    public static void main(String[] args) {
        ExecutorService executor = Executors.newFixedThreadPool(NTHREDS);
        for (int i = 0; i < 216000; i++) {
            Runnable worker = new MyTask(i);
            executor.execute(worker);
        }
        // This will make the executor accept no new threads
        // and finish all existing threads in the queue
        executor.shutdown();
        // Wait until all threads are finish
        while (!executor.isTerminated()) {

        }
        System.out.println("Finished all threads");
    }
}

For such implementation is JAVA a best choice or C# .net4?

user1213831
  • 309
  • 7
  • 22

5 Answers5

5

A couple of issues that I can see:

  • You synchronize on the MyTask object, which is created separately for each execution. You should be synchronizing on a shared object, preferably the one that you are modifying i.e. the locations object.

  • 216,000 runs, multiplied by say 10,000 returned objects each, multiplied by a minimum of 12 bytes per Integer object is about 24 GB of memory. Do you even have that much physical memory available on your computer, let alone available to the JVM?

    32-bit JVMs have a heap size limit of less than 2 GB. On a 64-bit JVM on the other hand, an Integer object takes about 16 bytes, which raises the memory requirements to over 30 GB.

    With these numbers it's hardly surprising that you get an OutOfMemoryError...

    PS: If you do have that much physical memory available and you still think that you are doing the right thing, you might want to have a look at tuning the JVM heap size.

EDIT:

Even with 25GB of memory available to the JVM it could still be pushing it:

  • Each Integer object requires 16 bytes on modern 64-bit JVMs.

  • You also need an 8-byte reference that will point to it, regardless of which List implementation you are using.

  • If you are using a linked list implementation, each entry will also have an overhead of at least 24 bytes for the list entry object.

At best you could hope to store about 1,000,000,000 Integer objects in 25GB - half that if you are using a linked list. That means that each task could not produce more than 5,000 (2,500 respectively) objects on average without causing an error.

I am unsure of your exact requirement, but have you considered returning a more compact object? For example an int[] array produced from each HashSet would only keep the minimum of 4 bytes per result without the object container overhead.

EDIT 2:

I just realized that you are storing the HashSet objects themselves in the list. HashSet objects use a HashMap internally which then uses a HashMap.Entry object of each entry. On an 64-bit JVM the entry object consumes about 40 bytes of memory in addition to the stored object:

  • The key reference which points to the Integer object - 8 bytes.

  • The value reference (always null in a HashSet) - 8 bytes.

  • The next entry reference - 8 bytes.

  • The hash value - 4 bytes.

  • The object overhead - 8 bytes.

  • Object padding - 4 bytes.

I.e. for each Integer object you need 56 bytes for storage in a HashSet. With the typical HashMap load factor of 0.75, you should add another 10 or bytes for the HashMap array references. With 66 bytes per Integer you can only store about 400,000,000 such objects in 25 GB, without taking into account the rest of your application any any other overhead. That's less than 2,000 object per task...

EDIT 3:

You would be better off storing a sorted int[] array instead of a HashSet. That array is searchable in logarithmic time for any arbitrary integer and minimizes the memory consumption to 4 bytes per number. Considering the memory I/O it would also be as fast (or faster) as the HashSet implementation.

thkala
  • 84,049
  • 23
  • 157
  • 201
  • I have a physical memory of 30 GB. I tried to tune from 8 gb to 16GB, 25GB using -Xmr, but no success. I am using synchronize on the shared object,I corrected that it was a typing error on my side. My understanding was GC mark local HashSet as a candidate for removal at end of each thread execution, but doesnt actually removes it from memory. At some priodic intervals , GC tries to clear the memory and as there will be many such HashSet in the memory to be removed...GC just stops/hang my processing. Is there a way to dispose HashSet just before the end of thead execution? – user1213831 Feb 25 '12 at 18:18
  • @user1213831: What kind of collection is `locations` and do you have an estimate on its size at the time of the error? – thkala Feb 25 '12 at 18:54
  • Thanks for the details. I will try to use int[] instead of HashSet and analyze the result. And will share the result. Thanks very much – user1213831 Feb 25 '12 at 22:15
  • Sorry , I forgot to mention. I cannot use int[] because the size is not known beforehand. What if I user arraylist? – user1213831 Feb 28 '12 at 11:43
  • @user1213831: if you are not modifying your HashSet *after* it is inserted in the list, its size *is* known and you can create an `int[]` and sort it for list insertion... – thkala Feb 28 '12 at 13:36
  • I used arraylist instead of hashset. So far it looks promising. Initial result was good , memory usage dropped from like 15 gb to 8gb. I am still waiting the full process to complete. My another question is , do you think I can achieve more control over speed + memory if the same service is in c# and not java? I mean is c# better than java when it comes to multi threading? – user1213831 Feb 28 '12 at 19:56
1

If you want a more memory efficient solution I would use TIntHashSet or a sorted int[]. In this case, you get a Full GC before an OutOfMemoryError. These are not the cause of the problem, but symptoms. The cause of the problem is you are using too much memory for the amount you are allowing as your maximum heap.

Another solution is to create tasks as you go instead of creating all your tasks in advance. You can do this by breaking your task in to NTHREAD tasks instead. It appears that you are trying to retain every solution. If so this won't help much. Instead you need to find a way to reduce consumption.

Depending on your distribution of numbers, a BitSet may be more efficient. This uses 1 bit per integer in a range. e.g. say your range is 0 - 20,000, This will use only 2.5 KB.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • So it means instead of HashSet, if i use HashSet where each element will be value in integer number ..it will consume only 1 bit? – user1213831 Feb 25 '12 at 18:37
  • No, Instead of `HashSet`, use a `BitSet` This uses a bit per possible value which is not an improvement if you have a lot of possible values and only a few actual values. – Peter Lawrey Feb 26 '12 at 07:44
1

Now after doing bit of research, I found that this memory exception occurs when GC is busy clearing the memory and at that point it stops the whole world to execute anything.

No - not true. Memory exceptions occur because you are using more memory than was allocated to your program. Very rarely is a memory exception due to some behavior of the GC. This can happen if you configure the GC in poorly.

Have you tried running with a larger -Xmx value? And why don't you just use a Hashtable for locations?

Amir Afghani
  • 37,814
  • 16
  • 84
  • 124
  • Yes I tried -Xmx to allocate more memory from 8 to 16GB. But still its not worth as the HashSet can grow larger. The reason I am using HashSet is because in the later stage, once all threads are done..I have to iterate thru the shared objects and in each itteration i need to check for whether the Hashset contains an element with certain value. And for this, using HashSet.Contains is more fast. Correct me if I am wrong. – user1213831 Feb 25 '12 at 18:15
1
  • If you are going to keep 216000 * 10000 Integers in memory you do require huge memory.
  • You can try Xmx settings to maximum allowable in your system and see how many objects you can store before you run out of memory.
  • It is not clear why you want to store the results of processing of so many threads, what is the next step? If you really need to store so much of data you need to probably use a database.
Ashwinee K Jha
  • 9,187
  • 2
  • 25
  • 19
0

You probably need to increase the size of your heap. Please look at the -Xmx JVM setting.

Peter Cetinski
  • 2,328
  • 14
  • 8