We're a MEMS / MOMS developer company with 25 years of experience and Oracle partnership.
Recently, several distinct services under our main product has had hangs. After some analysis we kind of determined the cause around Oracle DB access.
The problem might not be really caused by the Oracle driver, yet my goal is to both share information for other party that might have the same issue and also get information if there are some out there.
We never could reproduce the error on our side and we kind of know it is network related. On one of the customers' network, further analysis using Wireshark, has proven loss of TCP packages from time to time… Still it is so very hard as you may have experienced to convince / please your customers. And on the other hand, the error causes our threads to hang and renders the system undependable.
Nothing on our code has changed, nothing on the customer servers or networks had changed, yet the issue shows itself several times a day now, on several customers.
We took several precautions and actions in the way of making our DB accesses asynchronous. It helped a little. But, only to some extend. Some of our DB access, if not all, need to be strictly synchronous.
None of the actions we took, including experimentation with pool parameters helped.
Thus we -all of a sudden- have a soft-spot in our product now. Probably a tiny minor insignificant Windows / Dot Net update, we owe this to. Yet it is ours to solve…
In my most recent investigation, I managed to debug a process on the customer side while it was still in a frozen state. When I analyzed the stack trace I did see the issue was really related with the DB access. But actually it was when the Oracle code attempted to call a Socket.Receive() from the framework (dot net 4.0). Later I did investigate two separate occurrences on different customers. And positively(?) found the same tip on the frozen threads.
The Oracle Managed DA library used was 4.122.19.1 (2019-11-22)
[1](https://i.stack.imgur.com/JGSi5.png)
[2](https://i.stack.imgur.com/a5pc9.png)
[3](https://i.stack.imgur.com/uCazw.png)
The actions seen in the logs regarding the latest DB executions before the freeze always seem completed (on DB records). The stack trace -in my opinion- points to the same conclusion any way. Oracle code seems to have completed its work and the connection, attempts to switch to its next state maybe.
Before venturing deeper and contacting MS, I gone ahead and upgraded our project to dot net 4.7.2 and replaced the Oracle Managed Data Access with the latest 4.122.21.1 (I think it was actually listed as 21.9.0) dated 2023-01-10.
Deployed the code to two of our customers and still expecting a callback if the problem will still show itself.
If anyone has experienced / solved such an issue, any kind of help or insight would greatly be appreciated.
I'll list the actions we took and some further details around new information.
We did write an extension for Non-query executions in order to be able to have control over timeouts. Cause it did not do it itself properly. So an async task => wait => throw exception approach was chosen.
We took actions to utilize the connection pooling feature in all accesses.
We created a wrapper class for all Oracle Data Reader accesses and made sure a new connection is used for every read and also managed the disposals with close attention.
Same (new connection) approach was performed for every other access.
We did similar changes to all Oracle Command uses and make sure disposes are under control as well.
We changed the Oracle Managed Data Access library to the latest version.
We changed our Dot Net version to 4.7.2 from 4.0.
The actions caused success on one of the customers.
But the one with the terrible network subsystem continued to show problems, with a whole different view now...
Let me try to describe:
Transforming some DB related code to run asynchronously seems to have worked to some extend. The locks still occur from time to time yet a lot less frequent. And the call stack now serves a different picture.
the difference in call stacks The one on the left is locked and the two on the right side are working properly at the time of the captures taken.
Then I checked the thread-safety elements to make sure no deadlock were occurring.
And lately we discovered that the system actually continues if waited long enough, after some serious time. Like after 2700 minutes, we detected the system continued to operate.
The mind blowing thing is, what our latest extreme logging shows us... The flow of the code does not continue after lockups. The far end of the call stack seems to have changed without completing the then-current branch.
It's a sequential code under a while loop where it performs various operations on a single thread.
The call stack breaks when the lock-up occurs. And after the system unlocks, the next log shows another operation starting from the root of the main call stack. The previous one was not completed.
The log writes are sequential and synchronous. No disk errors were found in the Event Viewer. No exceptions were logged. The system was not restarted. Global exception handler were not hit as well.
It seems like the call stack was broken at some point and somehow the system continued under the same root loop. Again, no exceptions were logged.
Then while I was investigating the issue that occurred 2-3 days ago, I noticed one of the services lock-up. I tried to access its web service and I saw I could cause the same view of the broken stack pointer issue several times.
The operation that loads the web service for the web client was interrupted several times without errors every time I pressed Control + F5 on the browser.
Since the issue is now in the twilight zone category, I believe we're missing something fundamental. Maybe the logs are not working properly and are misleading. Maybe the computer hardware or the operating system is not right in the head. What the heck even causes a scheduler to malfunction?