application server CPU go to >80 and hang after nearly 24 hour the same problem repeats every day

Question

I have IBM WebSphere Application 8.5 server work with Db2 11.1 works from 2 years. Since a month the Application server hangs, the dB CPU goes to 0 and the application server CPU go to >80 , and hang after nearly 24 hour the same problem repeats every day. with logs on app server

db2diag Error today 2020-12-09-10.03.24.732486+120 I1234525159E610 LEVEL: Error PID : 5737 TID : 139739072030464 PROC : db2sysc 0 INSTANCE: db2inst1 NODE : 000 DB : WPJCR APPHDL : 0-38161 APPID: ::ffff:x.42258.201209075007 UOWID : 199 ACTID: 1 AUTHID : DB2INST1 HOSTNAME: ERTUWCMDB1Az EDUID : 1760 EDUNAME: db2agent (WPJCR) 0 FUNCTION: DB2 UDB, common communication, sqlcctest, probe:50 MESSAGE : sqlcctest RC DATA #1 : Hexdump, 2 bytes 0x00007F1789BFCDE0 : 3600 6.

2020-12-09-10.03.24.732661+120 I1234525770E601 LEVEL: Error PID : 5737 TID : 139739072030464 PROC : db2sysc 0 INSTANCE: db2inst1 NODE : 000 DB : WPJCR APPHDL : 0-38161 APPID: ::ffff:x.42258.201209075007 UOWID : 199 ACTID: 1 AUTHID : DB2INST1 HOSTNAME: ERTUWCMDB1Az EDUID : 1760 EDUNAME: db2agent (WPJCR) 0 FUNCTION: DB2 UDB, base sys utilities, sqeAgent::AgentBreathingPoint, probe:10 CALLED : DB2 UDB, common communication, sqlcctest RETCODE : ZRC=0x00000036=54

[11/3/20 6:42:13:596 EET] 000006ad XATransaction E J2CA0027E: An exception occurred while invoking rollback on an XA Resource Adapter from DataSource jdbc/wpjcrdbDS, within transaction ID {XidImpl: formatId(57415344), gtrid_length(36), bqual_length(54),

data(000001758c648aa7000000082a775800f8c220c5f6bdab92156eae0be31e28ea7605ade8000001758c648aa7000000082a775800f8c220c5f6bdab92156eae0be31e28ea7605ade8000000010000000000000000000000000001)} : com.ibm.db2.jcc.am.XaException: [jcc][t4][2041][12326][4.25.13] Error executing XAResource.rollback(). Server returned XAER_NOTA. ERRORCODE=-4203, SQLSTATE=null

After a while the dB CPU goes to 0 and the application server CPU go to >80 and hang after nearly 24 hour the same problem repeats.

is this deadlock or locktimeout due to data corruption??

Please carefully examine the Db2 diagnostic files (for example db2diag.log) and the Db notification log on the Db2-server. If there are deadlocks or timeouts they will be mentioned there. Your question is not about programming, but instead it is about __troubleshooting__ and for this, you need to have competent people who know how to read and understand the log files. Also helpful is to determine what has changed a month ago. — mao, Dec 10 '20 at 12:33
LEVEL: Error PID : 5737 TID : 139739072030464 PROC : db2sysc 0 WPJCR APPHDL : 0-38161 APPID: UOWID : 199 ACTID: 1 AUTHID : DB2INST1 HOSTNAME: ERTUWCMDB1Az EDUID : 1760 EDUNAME: db2agent (WPJCR) 0 FUNCTION: DB2 UDB, common communication, sqlcctest, probe:50 MESSAGE : sqlcctest RC DATA #1 : Hexdump, 2 bytes 0x00007F1789BFCDE0 : 3600 6. — noha Abdallah, Dec 10 '20 at 14:03
welcome to SO. Please, take your time to properly format your question before posting. — Daemon Painter, Dec 10 '20 at 14:21
This is probably more related to DB2 than WAS, but Link to WAS performance debug: https://www.ibm.com/support/pages/node/72419 That helps you collect thread dumps for WAS process. Tells which WAS threads are using the most CPU and if there is deadlock. On Windows, a Jython script is used to collect thread dumps: Put the following contents in a file named ThirtyThreadDumps.py (substitute the correct server name for "server1"): jvm = AdminControl.completeObjectName('type=JVM,process=server1,*') for x in range(30): AdminControl.invoke(jvm, 'dumpThreads') Sleep(30) — jblye, Dec 10 '20 at 14:22

F Rowe · Answer 1 · 2020-12-10T15:08:41.533

Without seeing any other app server logs, the combination of you noting that

"nearly 24 hour the problem repeats"
the sqeAgent::AgentBreathingPoint error (see IBM technote https://www.ibm.com/support/pages/what-does-agentbreathingpoint-error-mean-db2 for more info)
the "works from 2 years. Since a month the Application server hangs"

would lead me to look for a change in your network where an connection timeout has been set recently, closing connections after 24 hours. This can be caused by replacing a router or upgrading firmware where settings are different. Does this occur at about the same time everyday and if so, is it occurring as the app goes from a quiet state (like overnight) to a busy state (like start of a workday)? Based on your answer, it sounds like the entire connection pool is becoming "stale" overnight, meaning the connections are not being used and a network timeout is causing them to become disconnected from the db server. You can try changing the WAS datasource settings for "Minimum connections" to 0 and the "Unused Timeout" to perhaps 12 hours. This will allow the connection pool to drain overnight as the server traffic quiesces. As the app load starts in the morning, new connections will be obtained, avoiding the errors. If your "Maximum Connections" settings is very large, you may experience some slowness as the connection pool is being filled.

this occur at about the same time everyday with shifting +1 to +2 hours daily so not the same time it shifts daily .yes it start from quiet to busy state — noha Abdallah, Dec 10 '20 at 14:50

application server CPU go to >80 and hang after nearly 24 hour the same problem repeats every day

1 Answers1