3

We have a JavaEE app running on Weblogic against Oracle 11g DB, using thin JDBC driver. Recently we had a series of incidents in production where updates and inserts into a certain table got stuck or took much longer than normal, with no apparent reason. This caused the application to use more and more DB connections (normally idle in the connection pool), the DB CPU and concurrency shot up (as seen in OEM) and the whole DB ground to a halt. During these incidents the DBAs could not find any reason for Inserts and Updates to be stuck (no db locks). What they did see were a lot of "SQL*Net wait message from client" events.

Their theory is that the app (the jdbc client) got stuck, somehow, during insert/update statements, for a reason unrelated to the DB, while not acknowledging the DB response to these statements. And the fact that the app continued issuing more and more of these statements tying up more and more connections, was the reason that the CPU and concurrency shot up, making the DB unresponsive.

I'm not convinced - if all the sessions were busy waiting for clients, how come the CPU was so high? We weren't able to consistently reproduce these incidents so we are really in the dark here...

Has anyone seen anything like this or have any ideas, suggestions of what this might be caused by?

Thanks

dkarlinsky
  • 343
  • 1
  • 4
  • 10
  • maybe look for something like trying to extend a tablespace - maybe having issues with disk space? – Randy Dec 26 '12 at 17:05
  • Is Oracle running on the same server as WebLogic? Or on separate servers? – Codo Dec 26 '12 at 17:17
  • A million reasons, quite literally... I would focus, as @randy has, on OS, disks / filesystem (you're not using ZFS/NAS are you?) etc but I'd only be able to make wild guesses and will probably be wrong. Without root access to your boxes while this is happening I don't see how anyone here will be able to help you meaningfully. When this sort of thing happens I normally grab a couple of sysadmins and a DBA or 2 and ensure that everyone sits down together... we don't have that luxury here. What sort of answer are you after? – Ben Dec 26 '12 at 17:22
  • @code: The data base is running on a dedicated server. – dkarlinsky Dec 27 '12 at 14:35
  • @Ben: you are right of course, the problem is that the DBAs and the sysadmins looked at the systems while this was happening (and it happened several times) and we still don't have a meaningful answer... I guess I'm just hoping someone could suggest something we haven't thought of... – dkarlinsky Dec 27 '12 at 14:37
  • Hello, Dear DKarlinsky, have you fixed this issue. In our application we have same symptoms. – Alex G Mar 16 '15 at 07:42

2 Answers2

1

What you're describing is a "connection storm". A badly configured connection pool will "handle" slowly responding connections by opening new connections to service waiting requests. These additional requests place further strain on a server which is already stressed (if it wasn't stressed the initial connections wouldn't be lagging). This initiates a cycle of poor response spawning additional connections which eventually kill the server.

You can avoid the connection storm by setting the Maximum Capacity of the data source to something reasonable. The definition of "reasonable" will vary according to the capabilities of your servers, but it is probably lower than you think. The best advice is to set the Maximum Capacity to the same value as Initial Capacity.

Once you prevent the Connection Storm you can focus on the database process(es) which cause the initial slowdown.


The high number of SQL*Net wait message from client events indicate that the client is doing something without contacting the database. That is why your DBAs reason that the problem lies with the app.

APC
  • 144,005
  • 19
  • 170
  • 281
  • Actually, our connection pools are configured to open all the connections on startup (initial capacity = max capacity), so it never opens more connections under load. Regarding SQL*Net wait - I still don't understand how it would cause high CPU and make the DB unresponsive... – dkarlinsky Dec 27 '12 at 14:41
  • I've run into a similar issue, but the stuck connection sticks around even after the client is gone... – Lukas Eder Jun 10 '15 at 12:14
1

I've encountered a similar issue, which I've documented here: Unkillable Oracle session waiting on "SQL*Net message from client" event. In my case, the problem was caused by a bind variable of type CLOB that was bound to a place where CLOBs seem to cause severe issues in Oracle. The following statement produces the same behaviour as you've observed:

CREATE TABLE t (
  v INT, 
  s VARCHAR2(400 CHAR)
);

var v_s varchar2(50)
exec :v_s := 'abc'

MERGE INTO t                      
USING (
  SELECT 
    1 v, 
    CAST(:v_s AS CLOB) s 
  FROM DUAL
) s 
ON (t.s = s.s) -- Using a CLOB here causes the bug.
WHEN MATCHED THEN UPDATE SET
  t.v = s.v        
WHEN NOT MATCHED THEN INSERT (v, s) 
VALUES (s.v, s.s);

Probably, there are other occasions with other statements than MERGE that expose this behaviour producing zombie sessions as well, as Oracle seems to run some infinite loop producing the observed CPU load.

Community
  • 1
  • 1
Lukas Eder
  • 211,314
  • 129
  • 689
  • 1,509