2

We have four lpars running 1 java instance each.

They do a lot of read/write operations to a shared NFS server. When the NFS server goes down abruptly, all the threads that were trying to read an image in each of these four servers gets into a hung state.

Below trace shows the same (process is a websphere applciation server process)

  1. While we are working on the issues in the NFS server side, is there a way to avoid this from the code side?

  2. If the underlying connection is tcp based (which I assume it is), should the tcp read/connect timeout take care of this? Basically I want to thread be returned back to the pool instead of waiting infinitely for the other side to repond.

  3. Or is this something which should be taken care by the nfs 'client' on the source machine? Some config setting on the client side pertaining to nfs (since FileInputStream.open would not know whether the file it is trying to read is on the local server or the shared folder in nfs server)

Thanks in advance for your answers :)

We are using

java 1.6 on WAS 7.0

[8/2/15 19:52:41:219 GST] 00000023 ThreadMonitor W WSVR0605W: Thread "WebContainer : 77" (00003c2b) has been active for 763879 milliseconds and may be hung. There is/are 110 thread(s) in total in the server that may be hung. at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:113) at java.io.FileInputStream.(FileInputStream.java:73) at org.emarapay.presentation.common.util.ImageServlet.processRequest(Unknown Source) at org.emarapay.presentation.common.util.ImageServlet.doGet(Unknown Source) at javax.servlet.http.HttpServlet.service(HttpServlet.java:718) at javax.servlet.http.HttpServlet.service(HttpServlet.java:831) at com.ibm.ws.webcontainer.servlet.ServletWrapper.service(ServletWrapper.java:1694) at com.ibm.ws.webcontainer.servlet.ServletWrapper.service(ServletWrapper.java:1635) at com.ibm.ws.webcontainer.filter.WebAppFilterChain.doFilter(WebAppFilterChain.java:113) at com.ibm.ws.webcontainer.filter.WebAppFilterChain._doFilter(WebAppFilterChain.java:80) at com.ibm.ws.webcontainer.filter.WebAppFilterManager.doFilter(WebAppFilterManager.java:908) at com.ibm.ws.webcontainer.servlet.ServletWrapper.handleRequest(ServletWrapper.java:965) at com.ibm.ws.webcontainer.servlet.ServletWrapper.handleRequest(ServletWrapper.java:508) at com.ibm.ws.webcontainer.servlet.ServletWrapperImpl.handleRequest(ServletWrapperImpl

VC1
  • 1,660
  • 4
  • 25
  • 42
  • This is a limitation of NFS. Nothing you can do about it in Java. – user207421 Aug 03 '15 at 08:02
  • EJP - I do agree this is a limitation on the infrastructure side. However code needs to handle this in some way or the other. E.g. (though might not be the right soln) Using FutureTask to read the image and timeout after few seconds so that atleast thread is returned to the pool? – Rajarajan Pudupatti Sundari Je Aug 03 '15 at 08:50
  • If the NFS host is completely hung in these cases, TCP keepalive might be able to speed up the detection. Otherwise, TCP is going to wait an eternity because it has no idea what protocols is sitting above it. – covener Aug 03 '15 at 11:25

3 Answers3

0

Check this solution https://stackoverflow.com/a/9832633/1609655

You can do something similar for reading the image. Basically wrap the call to read in a Java Future implementation and signal a thread kill when the operation does not finish in a specified amount of time.

It might not be perfect, but i will atleast prevent your server to be stuck for ever.

Community
  • 1
  • 1
Palanivelrajan
  • 121
  • 1
  • 10
0

This was the response from @shodanshok in serverfault and it helped us.

This probably depends on how the NFS share is mounted. By default, NFS shared are mounted with the "hard" parameters, meaning that accesses to a non-responding NFS share will block indefinitely.

You can change the client side mount point, adding one of the following parameters (I'm using Linux man page here, maybe your specific options are a little different):

soft: if the soft option is specified, then the NFS client fails an NFS request after retrans retransmissions have been sent, causing the NFS client to return an error to the calling applicationintr: selects whether to allow signals to interrupt file operations on this mount point. Using the intr option is preferred to using the soft option because it is significantly less likely to result in data corruption. FYI, this was deprecated in Linux kernel 2.6.25+

Source: Linux nfs man page

0

http://martinfowler.com/bliki/CircuitBreaker.html

This seems to be the perfect solution for this problem (and the similar kinds). The idea is to wrap the call in an another object which will prevent further calls (based on how you design this object to handle the situation) to the failed service.

E.g. When a external service becomes unresponsive, slowly threads goes into a hung state. Instead, it will be good if we have a THRESHOLD LEVLE which prevents threads from getting into that state. What if we can configure say, do not attempts to connect to the external service if it has not responded or waiting to respond for the previous 30 requests! In this case the 31 request will directly throw an error to the customer trying to access report (or send an error mail to the team) but ATLEAST the 31st thread WILL NOT BE STUCK waiting, instead it will be used to server other requests from other functionalities.

References:

http://martinfowler.com/bliki/CircuitBreaker.html

http://doc.akka.io/docs/akka/snapshot/common/circuitbreaker.html

http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html

https://github.com/Netflix/Hystrix