0

I am trying to figure out why our Oracle 11g has been misbehaving. We are getting the following error in all our jobs which access the database with JDBC:

Error message: java.sql.SQLRecoverableException: IO Error: Connection reset

Before anyone gets upset, yes, I have checked the other related links on stack overflow (and goodness only knows how many other sites). Nothing so far has helped and I'm wondering what I could try next.

The situation has worsened over the past few months and the only thing which has changed is that the jobs which access the database are running in docker containers. The only affected database is the oracle 11g (we have several other databases including oracle 12c). I can access the 11g database using sqlplus without difficulty.

Since this is a connectivity thing, I naturally checked the files tnsnames.ora and listener.ora. Here is what they contain (the database in question is SID=TMF). tnsnames.ora:

GLOB =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = linuxoracle11)(PORT = 1521))
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME = GLOB)
    )
  )

STATIC =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = linuxoracle11)(PORT = 1521))
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME = STATIC)
    )
  )

TMF =
  (DESCRIPTION =
    (SDU=2048)
    (TDU=2048)
    (ADDRESS = (PROTOCOL = TCP)(HOST = linuxoracle11)(PORT = 1521))
    (CONNECT_DATA =
      (SERVER = SHARED)
      (SERVICE_NAME = TMF)
    )
  )

DEMO =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = linuxoracle11)(PORT = 1521))
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME = DEMO)
    )
  )

HONDADB =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = linuxoracle11)(PORT = 1521))
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME = HONDADB)
    )
  )

TEMP =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = linuxoracle11)(PORT = 1521))
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME = TEMP)
    )
  )

EXTPROC_CONNECTION_DATA =
  (DESCRIPTION =
    (ADDRESS_LIST =
      (ADDRESS = (PROTOCOL = IPC)(KEY = EXTPROC1))
    )
    (CONNECT_DATA =
      (SID = PLSExtProc)
      (PRESENTATION = RO)
    )
  )

listener.ora:

SID_LIST_LISTENER =
  (SID_LIST =
    (SID_DESC =
      (GLOBAL_DBNAME = TMF)
      (ORACLE_HOME = /home/oracle/app/oracle/product/11.2.0/dbhome_1)
      (SID_NAME = TMF)
    )

    (SID_DESC =
      (GLOBAL_DBNAME = STATIC)
      (ORACLE_HOME = /home/oracle/app/oracle/product/11.2.0/dbhome_1)
      (SID_NAME = STATIC)
    )

    (SID_DESC =
      (GLOBAL_DBNAME = DEMO)
      (ORACLE_HOME = /home/oracle/app/oracle/product/11.2.0/dbhome_1)
      (SID_NAME = DEMO)
    )
  )


LISTENER =
  (DESCRIPTION_LIST =
    (DESCRIPTION =
      (ADDRESS = (PROTOCOL = TCP)(HOST = linuxoracle11)(PORT = 1521))
      (ADDRESS = (PROTOCOL = IPC)(KEY = EXTPROC1521))
    )
  )

For the life of me I cannot see what is wrong. The listener trace log has reams of entries like these:

09-JUN-2020 18:25:21 * service_died * TMF * 12537
09-JUN-2020 18:25:22 * service_register * TMF * 0
09-JUN-2020 18:25:27 * service_died * TMF * 12537
09-JUN-2020 18:25:28 * service_register * TMF * 0
Tue Jun 09 18:25:33 2020
09-JUN-2020 18:25:33 * service_died * TMF * 12537
09-JUN-2020 18:25:34 * service_register * TMF * 0
09-JUN-2020 18:25:39 * service_died * TMF * 12537
09-JUN-2020 18:25:40 * service_register * TMF * 0
Tue Jun 09 18:25:45 2020
09-JUN-2020 18:25:45 * service_died * TMF * 12537
09-JUN-2020 18:25:46 * service_register * TMF * 0
09-JUN-2020 18:25:51 * service_died * TMF * 12537
09-JUN-2020 18:25:52 * service_register * TMF * 0
Tue Jun 09 18:25:57 2020
09-JUN-2020 18:25:57 * service_died * TMF * 12537
09-JUN-2020 18:25:58 * service_register * TMF * 0
09-JUN-2020 18:26:03 * service_died * TMF * 12537
09-JUN-2020 18:26:04 * service_register * TMF * 0

My most recent attempt to fix this is a slight variation on the information in this oracle community post. The setup on the server:

[root@linuxoracle11 trace]# mv /dev/random /dev/xrandom
[root@linuxoracle11 trace]# ln -s /dev/urandom /dev/random
[root@linuxoracle11 trace]# ls -l /dev/*ndom
lrwxrwxrwx 1 root root   12 Jun  9 19:26 /dev/random -> /dev/urandom
crw-rw-rw- 1 root root 1, 9 Jun  9 09:58 /dev/urandom
crw-rw-rw- 1 root root 1, 8 Jun  9 09:58 /dev/xrandom

This did not, however, solve the problem and the jobs running against the database are still aborting.

The most irritating aspect is that sometimes the jobs run through without error. I'd be grateful f anyone has a hint in which direction I should look.

Edit: As requested, here is the content of the sqlnet.ora file:

SQLNET.RECV_TIMEOUT=5
SQLNET.INBOUND_CONNECT_TIMEOUT=5

Yes, that is the default generated when the database system was installed.

There is also the suggestion that it might be due to a "logon storm". That is doubtful, since we have few actual users on the database. Mostly it's those jobs I wrote about running in docker containers. I checked the oracle doco and some info (again from a Burleson Consulting page) and tried the following:

SQL> select resource_name, current_utilization, limit_value
  2  from v_$resource_limit
  3  where resource_name in ('processes', 'sessions');

RESOURCE_NAME                  CURRENT_UTILIZATION LIMIT_VALU
------------------------------ ------------------- ----------
processes                      25                  200
sessions                       30                  322

So we're talking about a couple of dozen connections, not hundreds.

Edit2: For the sake of completeness, here is a part of the stack trace:

****** Jun 12, 2020 9:20:32 PM IO Error: Connection reset ******
java.sql.SQLRecoverableException: IO Error: Connection reset
    at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:752)
    at oracle.jdbc.driver.PhysicalConnection.connect(PhysicalConnection.java:666)
    at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:32)
    at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:566)
    at java.sql.DriverManager.getConnection(DriverManager.java:664)
    at java.sql.DriverManager.getConnection(DriverManager.java:247)
...

Caused by: java.net.SocketException: Connection reset
    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:115)
    at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
    at oracle.net.ns.DataPacket.send(DataPacket.java:209)
    at oracle.net.ns.NetOutputStream.flush(NetOutputStream.java:215)
    at oracle.net.ns.NetInputStream.getNextPacket(NetInputStream.java:302)
    at oracle.net.ns.NetInputStream.read(NetInputStream.java:249)
    at oracle.net.ns.NetInputStream.read(NetInputStream.java:171)
    at oracle.net.ns.NetInputStream.read(NetInputStream.java:89)
    at oracle.jdbc.driver.T4CSocketInputStreamWrapper.readNextPacket(T4CSocketInputStreamWrapper.java:123)
    at oracle.jdbc.driver.T4CSocketInputStreamWrapper.read(T4CSocketInputStreamWrapper.java:79)
    at oracle.jdbc.driver.T4CMAREngineStream.unmarshalUB1(T4CMAREngineStream.java:429)
    at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:397)
    at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:257)
    at oracle.jdbc.driver.T4CTTIoauthenticate.doOAUTH(T4CTTIoauthenticate.java:437)
    at oracle.jdbc.driver.T4CTTIoauthenticate.doOAUTH(T4CTTIoauthenticate.java:954)
    at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:639)
    ... 38 more
Achim Schmitz
  • 498
  • 1
  • 7
  • 18
  • When it works, how long does it take - less than 5 seconds? Can you include your `sqlnet.ora` file tool does it have `sent_timeout` and `recv_timeout` set (to 5, presumably)? – Alex Poole Jun 10 '20 at 11:45
  • I've added the content of sqlnet.ora to the post. As for how long it takes, well, we're talking about how long it normally takes to connect to the database. The jobs I wrote about create jdbc connections at different times for reading data, reading catalog data, updating tables, and so on. Why it should work sometimes and sometimes not, that's what I need to figure out. – Achim Schmitz Jun 10 '20 at 15:44
  • Whether it works or not depends on whether the JDBC call coincides with a `service_died` message, presumably (again!). Comparing logs should give an indication of that. The shorter the call the more likely it is to get away with it; an operation that lasts more than five seconds would always fail, I think. As for SQL\*Plus, is that a local bequeath connection - that doesn't rely on SQL\*Net and so won't be affected? Do you see an issue if you connect via a service name - e.g. `sqlplus usr/pwd@//linuxoracle11:1521/TMF` ? – Alex Poole Jun 10 '20 at 16:42
  • @Alex Poole: compare logs? Which ones? This is a problem for me, because Oracle squirrels logs and alerts away in all sorts of places. I usually need to do a complex "find" to have a chance of finding one. Your surmise about sqlplus is correct. It runs on the server and therefore isn't affected by sqlnet issues. – Achim Schmitz Jun 11 '20 at 08:13
  • The listener trace log you already have, with the `service_died` message and timestamp; and your application log showing the JDBC errors with their timestamps (and hopefully some info just before that). When you get a JDBC error does it coincide with a `service_died`? When you *don't* get a JDBC error does the timing show that is happening between a `service_registered` and `service_died`? I'm pretty sure that's the case, checking the logs will just confirm really. – Alex Poole Jun 11 '20 at 08:17
  • @Alex Poole: I suspect you are right, but as some governments say "we can neither confirm nor deny...", because our job log timestamps are only accurate to the minute, not to the second. The listener trace shows the service dying every few seconds. What would cause that? Some sites I've read suggest that this message is "only informational", but that doesn't seem right. We are currently trying the suggestions in your answer and that of ibre5041. If something works I'll be sure to mention it. – Achim Schmitz Jun 11 '20 at 08:34
  • My answer says what can cause that \*8-) Though you need a support contract to be able to see the details. It isn't every *few* seconds, it's every *five* seconds, which matches the `sqlnet.ora` setting. Hopefully changing `sqlnet.ora` and bouncing the DB will stop the messages and cure the JDBC issue. Good luck. – Alex Poole Jun 11 '20 at 08:39

2 Answers2

1

here is the content of the sqlnet.ora file:

SQLNET.RECV_TIMEOUT=5
SQLNET.INBOUND_CONNECT_TIMEOUT=5

This seems to be the problem. The log shows the service_died messages five seconds after service_register, repeatedly. The service is detaching from and reattaching to the over and over. Your JDBC calls work if they happen to start and end between a register/died pair; and fail if they start but haven't finished when the service_died is reported.

According to the documentation it shouldn't really be a problem - they show these values set to 3, after all.

It seems to be a known problem though; if you have access to My Oracle Support then look at Doc ID 2162994.1, which suggests increasing those setting to be greater than 60 or removing them altogether - and you might need to bounce the database to clean it up. (If you don't have access to MoS you can see the public outline, which doesn't tell you much but is similar to what you see in your listener log.)

There are other issues, searching in MoS reveals Doc IDs 1206583.1 and others, so there could be a bug you can resolve or patch - it depends a bit on exactly which version and patch level of Oracle you are using. If in doubt you can raise a service request, of course; assuming you have a support contract.

Community
  • 1
  • 1
Alex Poole
  • 183,384
  • 11
  • 179
  • 318
  • Setting the timeout to 60 or so would possibly fix things for most situations, but in our latest planning and review meeting we decided to retire the 11g and set up a 19c instead. BTW: We do not have a support contract. – Achim Schmitz Jun 17 '20 at 16:03
  • After setting up a replacement 19c database, I decided to try going through the settings on the 11g again. In the end I simply removed `SQLNET.RECV_TIMEOUT` setting in sqlnet.ora altogehter. That has apparently fixed the problem. I can only suggest that anyone else in this position try the same. – Achim Schmitz Jul 15 '20 at 12:43
0

You should provide full stack trace When the stack trace contains:

at oracle.jdbc.driver.T4CConnection.logon

Then connection was reset by the database, during authentication. (Not by any networking device on the way). The root cause issue of such a situation (logon storm) is actually a problem on client's side. Due to lack of random numbers the client was not able to authenticate itself fast enough. When this happens then you can:

  • Tune kernel parameters to hold more random numbers in buffer(client)
  • Use that trick where JVM uses another /dev/*random device(client)
  • Change parameter SQLNET.INBOUND_CONNECT_TIMEOUT to extend the time window when servers waits for the client(db server)

But there can be other reasons why you do receive this error.

PS: JVM has explicit protections against 2nd "solution" (Java bug 6202721). As it compromises security. So if you set "java.security.egd=file:/dev/urandom" it will intentionally ignore you. While this device name is not black listed "java.security.egd=file:/dev/./urandom"

ibre5041
  • 4,903
  • 1
  • 20
  • 35
  • I have changed our docker-compose config to `"java.security.egd=file:///dev/./urandom"`. We had that in the config when we were running without docker, but for unknown reasons it was changed to `"java.security.egd=file:///dev/urandom"` when we moved to docker. Thanks for the tip. I'll be sure to let you know if it works. – Achim Schmitz Jun 11 '20 at 08:22
  • In case of Docker you do share the same /dev/random kernel device with other containers. In order to prove that you really should provide at least a fragment of your Java stacktrace. – ibre5041 Jun 11 '20 at 09:16
  • I have added the stack trace to the description. It doesn't really add much to that which was already known. After a planning and review meeting, we have now decided to replace the oracle 11g with an oracle 19c. This means this problem is no longer relevant. I hope the discussion helps someone else to gain some insights into the nature of this error message. – Achim Schmitz Jun 17 '20 at 15:47
  • According to this https://stackoverflow.com/a/59097932/836215 it looks like it get fixed after upgrading to Java11 (since Java11, JVM uses non-blocking /dev/urandom by default) – ibre5041 Jun 17 '20 at 16:08