2

Load 1.5 Million Records from Database 1

Load 1.5 Million Records from Database 2

List<DannDB> dDb = fromNamedQuery(); //return em.createNamedQuery("").getResultList();
List<LannDB> lDb = fromNamedQuery();

Compare its data.

Update/persist into Database (Using JPA)

and program ends after two hours.

Same iteration happens every third hour and many a times give Out of Memory.

Does following statement work, does object becomes out of scope with this?

 dDb.clear();

  or 

 dDb = null

or what else I can do?

fatherazrael
  • 5,511
  • 16
  • 71
  • 155
  • 2
    It should get garbage collected when it goes out of scope. What's your program doing for two hours? – shmosel Jun 09 '21 at 09:21
  • My program is waiting for 1.5 hrs to get data only from source view from Database 1. Then other process takes around 10, 15 to max 30 mins. Now how will scope become out of scope? Does clear or null make it out of scope? – fatherazrael Jun 09 '21 at 09:37
  • 1
    In the vast majority of cases setting a variable to null or calling `clear()` is not necessary, because the variable should go out of scope if it's no longer needed anyway. In fact calling `clear()` might cause *more* work than simply letting it be garbage collected. **If** however you somehow hold on to that variable even though you no longer need it, then you should change that. And if you **can't** change that for some reason, setting it to `null` might work. Show us more precisely where this variable is. Is it a local variable? A field? – Joachim Sauer Jun 09 '21 at 09:42
  • It is scheduler. Program ends (Use JTA/JPA using Wildfly). Nothing happen at end. When you say out of scope then it means Global { Local } <-- At end line Local scope finishes and Global finishes when scheduler stops. When EJB scheduler starts, it create new object basically. (So when heap continue to fill, Collector must be noticing to empty previous object based on scopes?) .. Also is it possible Garbage collector gets late in collecting object and OOM? – fatherazrael Jun 09 '21 at 09:50
  • 5
    Why are you loading 1.5 million records from the database in the first place? This is the real problem. – user207421 Jun 09 '21 at 09:58
  • @user207421: It is unavoidable requirement. – fatherazrael Jun 09 '21 at 10:12
  • There are quite some stuff that is not known by us. How large is a record? What is the memory foot print of a record? Is there a performance requirement for the operation? Could you store the records offheap? – Erik Jun 09 '21 at 11:10
  • It has only 6 columns (2 Email Address, ID Integer(20), GUID, String of Role Name like Manager). Memory footprint of record means? heap stack? – fatherazrael Jun 09 '21 at 11:18
  • How many bytes is the record? if you wrote all the records to a file, how large would the file be? – Erik Jun 09 '21 at 11:23
  • you expect to load 1.5 million of entries and everything to be fine? the amount of memory needed is not going to be small, at all. – Eugene Jun 09 '21 at 15:09
  • If I load chunks of data say 100,000 at one DB1 and load 100,000 from DB2-compare(again other 100,000 from DB2-compare... upto till 1.5 million-compare). and again next 100,000 from DB1. What do you think will it be memory efficient? Will garbage collection done easily or it will lead to same? (Just thinking it will be quite long task and DB guys would be involve in some performance then only i can do this) – fatherazrael Jun 09 '21 at 18:12

4 Answers4

4

Assuming that your goal is to reduce the occurrence of OOMEs over all other considerations ...

Assigning null to the List object will make the entire list eligible for garbage collection. You then need to create a new (presumably empty) list to replace it.

Calling clear() will have a similar effect1 to nulling and recreating, though the details will depend on the List implementation. (For example, calling clear() on an ArrayList doesn't release the backing array. It just nulls the array cells.)

If you can recycle an ArrayList for a list of roughly the same size as the original, you can avoid the garbage while growing the list. (But we don't know this is an ArrayList!)

Another factor in your use-case is that:

List<DannDB> dDb = fromNamedQuery();

is (presumably) going to create a new list anyway. That would render a clear() pointless. (Just assign null to dDb, or let the variable go out of scope or be reassigned the new list.)

A final issue is that it is conceivable that the list is finalizable. That could mean that the list object takes longer to delete.

Overall, I can't say which of assigning null and calling clear() will be better for the memory footprint. Or that either of these will make a significant difference. But there is no reason why you can't try both alternatives, and observe what happens.

The only other things I can suggest are:

  • Increase the heap size (and the RAM footprint).
  • Change the application so that you don't need to hold entire database snapshots in memory. Depending on the nature of the comparison, you could do it in "chunks" or by streaming the records2.

The last one is the only solution that is scalable; i.e. that will work with an ever larger number of records. (Modulo the time taken to deal with more records.)


Important Notes:

  1. Manually running System.gc() is unlikely to help. At best it will (just) make your application slower.
  2. Since the real problem is that you are getting OOMEs, anything that tries to get the JVM to shrink the heap by giving memory back to the OS will be counterproductive.

1 - Similar from the perspective of storage management. Obviously, there are semantic differences between clearing a list and creating a new one; e.g. if some other part of your application has a reference to the original list.
2 - Those of you are old enough will remember the classic way of implementing a payroll system with magnetic tape storage. If you can select from the two data sources in the same key order, you may be able to use the classic approach to compare them. For example, reading two resultsets in parallel.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • Thanks for Comment Stephen. Suppose, If I load chunks of data say 100,000 at one DB1 and load 100,000 from DB2-compare(again other 100,000 from DB2-compare... upto till 1.5 million-compare). and again next 100,000 from DB1. What do you think will it be memory efficient? Will garbage collection done easily or it will lead to same? (Just thinking it will be quite long task and DB guys would be involve in some performance then only i can do this) – fatherazrael Jun 09 '21 at 18:12
  • If the size of the in-memory lists is smaller then there will be will be less heap pressure, and GC overheads will be less. And less risk of OOMEs. (This doesn't necessarily mean that the total memory (RAM) footprint will be smaller, but that's not what you should optimize for ...) – Stephen C Jun 10 '21 at 00:48
  • @Stephel C: As per above comments, i understand At this moment no other way for optimizing then Chunk retrieval of records and compare with chunk n from db 2 . The only thing is increase memory if want to do this. – fatherazrael Jun 11 '21 at 04:37
  • 1
    That would be my advice. – Stephen C Jun 11 '21 at 06:13
0

In the case of SQL, you can get your two ResultSets and compare their data iteratively. This way, you don't have ro save all your data in first place.
I assume that your data looks like this for demonstration purposes:

String email1 String email2 int someInt
abc@def.ghi jkl@mno.pqr 1234567
xyz@gmail.com 8901234


To detect a difference between two ResultSets of this database:

boolean equals(ResultSet a, ResultSet b) {
    while(a.next() && b.next()) {
        String aEmail1 = a.getString(1);
        String bEmail1 = b.getString(1);
        if(!aEmail1.equals(bEmail1)) return false;
        String aEmail2 = a.getString(2);
        String bEmail2 = b.getString(2);
        if(!aEmail2.equals(bEmail2)) return false;
        int aSomeInt = a.getInt(3);
        int bSomeInt = b.getInt(3);
        if(aSomeInt!=bSomeInt) return false;
        if(a.isLast()!=b.isLast())
            throw new IllegalArgumentException(
                "ResultSets have different amounts of rows!"
            );
    }
    return true;
}

To set the contents of ResultSet oldData (also its corresponding database connection) to ResultSet newData:

void updateA(ResultSet oldData, ResultSet newData) {
    while(oldData.next() && newData.next()) {
        String newEmail1 = newData.getString(1);
        oldData.updateString(1,newEmail1);
        String newEmail2 = newData.getString(2);
        oldData.updateString(2,newEmail2);
        int newSomeInt = newData.getInt(3);
        oldData.updateInt(3,newSomeInt);
        if(oldData.isLast()!=newData.isLast())
            throw new IllegalArgumentException(
                "ResultSets have different amounts of rows!"
            );
    }
}


You can ofcourse leave out the if(a.isLast()!=newData.isLast)) ... and if(oldData.isLast()!=newData.isLast()) ... if you don't care that the two sets don't have the same amount of rows.

Quadslab
  • 294
  • 3
  • 9
  • Are you sure if it will solve memory issue? – fatherazrael Jun 11 '21 at 09:09
  • @fatherazrael It always depends on the JVM, but you don't need to box `int`, `float`, etc. in their classes with this approach. `ResultSet` is the direct result of a `Statement` and you probably can't really get less overhead in SQL than using this. I would try this, and if there still is a memory issue, you need to change the statement itself, i.e. not getting everything in one call, but get the first n rows or columns instead. – Quadslab Jun 11 '21 at 10:10
-1

The thing is that, by default, once allocated heap memory size does not shrink (I mean the memory size allocated from the operating system). If your Java application at one time needed 2 GB of RAM it will keep that reserved from the operating system by default.

If you can, try to change the design of your application to not firstly load all data into memory, but only load what you really need to do your work.

If you really need the two big batches at the same time think about using the following Java command line argument: "-XX:+UseAdaptiveSizePolicy", which would make it possible to shrink the heap space after big memory usages.

You can also call the garbage collector via "System.gc();", but that a) does not shrink the allocated heap memory without the suggested command line argument, and b) really, you should not think about this. Java will run it on it's own by time.

Edit: Improved my first explanation a bit.

michael.k
  • 168
  • 9
  • 2
    The standards don't require that heap never shrinks. In fact the JVM spec explicitly says "The heap may be of a fixed size or may be expanded as required by the computation and may be contracted if a larger heap becomes unnecessary. " but doesn't say much more about how that's done. All the details of GC and memory management like that are implementation-defined. – Joachim Sauer Jun 09 '21 at 09:50
  • Now Heap size of server group is 1024 mb and max heap size is 4096 mb. So do you mean 1024 is reserved by OS and 4096 will considered as Virtual memory? – fatherazrael Jun 09 '21 at 09:52
  • @JoachimSauer Well the thing is that many developers expect the memory consumption of an application – which they may see in the Task Manager or something similar – to shrink once their program frees data. That is the case with a C/C++ program etc., but not so with the common JVM implementations (unless you do configure it to shrink). – michael.k Jun 09 '21 at 09:54
  • 4
    @michael.k: I think the phrase "by standard" is confusing here, you probably meant "by default", and I thought you meant that this behaviour is defined in the standards, which it is not. – Joachim Sauer Jun 09 '21 at 09:56
  • @fatherazrael Once your program allocated the 4096mb of heap memory it will keep it allocated from the operating system in the usual JVM implementations, even if your need for heap size shrinks to 10mb. – michael.k Jun 09 '21 at 09:57
  • @JoachimSauer You're right, I changed that "standard" to "default", thanks. – michael.k Jun 09 '21 at 09:59
  • @michael.k: I increased memory heap size to 10 GB in production and found that in first execution it went fine and in second execution. It shuts down Linux server (0 memory ) :( – fatherazrael Jun 09 '21 at 10:25
  • 2
    @fatherazrael this answer is completely wrong. the heap [does shrink](https://stackoverflow.com/questions/59362760/does-g1gc-release-back-memory-to-the-os-even-if-xms-xmx/59377080#59377080) even before jdk-12 where there is an explicit setting for this for `G1GC`. And the process is called un-commit memory. It depends on the JVM version, GC. – Eugene Jun 09 '21 at 15:03
  • 2
    also `UseAdaptiveSizePolicy` has exactly zero to do with shrink of the heap. You confuse far too much in your answer. – Eugene Jun 09 '21 at 15:05
  • and ... shrinking of the heap has nothing to do with the OP's actual problem. It won't prevent OOMEs. – Stephen C Jun 11 '21 at 06:15
-1

Best for memory usage would be for the list to not go out of scope. So it would be better (memory wise) to just modify the content one by one, keeping only one temporary entry object instead of a whole other list.

So you could create a getNextFromNamedQuery() and hasNextInNamedQuery() method and set the data at the current index.

e.g.:

int i=0;
while(hasNextInNamedQuery()) {
    if(dDb.size()<=i) dDb.add(getNextFromQuery());
    else dDb.set(i,getNextFromQuery());
    i++;
}
Quadslab
  • 294
  • 3
  • 9
  • Do not understand. Named query returns list of entities. like em.createNamedQuery().getResultList(). Now how to play with result lIst? – fatherazrael Jun 09 '21 at 18:14
  • @fatherazrael what are you using to get your query? – Quadslab Jun 09 '21 at 18:45
  • select * from table name – fatherazrael Jun 11 '21 at 04:32
  • @fatherazrael That you are using SQL is rather important information! You can use a [ResultSet](https://docs.oracle.com/javase/8/docs/api/java/sql/ResultSet.html) and only get the data for the next row to do the comparisons. No need to create a list for all of the data. – Quadslab Jun 11 '21 at 05:11