Weird "Stale file handle, errno=116" on remote cluster after dozens of hours running

Question

I'm now running a simulation code called CMAQ on a remote cluster. I first ran a benchmark test in serial to see the performance of the software. However, the job always runs for dozens of hours and then crashes with the following "Stale file handle, errno=116" error message:

PBS Job Id: 91487.master.cluster Job Name: cmaq_cctm_benchmark_serial.sh Exec host: hs012/0 An error has occurred processing your job, see below. Post job file processing error; job 91487.master.cluster on host hs012/0Unknown resource type REJHOST=hs012.cluster MSG=invalid home directory '/home/shangxin' specified, errno=116 (Stale file handle)

This is very strange because I never modify the home directory and this "/home/shangxin/" is surely my permanent directory where the code is....

Also, in the standard output .log file, the following message is always shown when the job fails:

Bus error 100247.930u 34.292s 27:59:02.42 99.5% 0+0k 16480+0io 2pf+0w

What does this message mean specifically?

I once thought this error is due to that the job consumes the RAM up and this is a memory overflow issue. However, when I logged into the computing node while running to check the memory usage with "free -m" and "htop" command, I noticed that both the RAM and swap memory occupation never exceed 10%, at a very low level, so the memory usage is not a problem.

Because I used "tee" to record the job running to a log file, this file can contain up to tens of thousands of lines and the size is over 1MB. To test whether this standard output overwhelms the cluster system, I ran another same job but without the standard output log file. The new job still failed with the same "Stale file handle, errno=116" error after dozens of hours, so the standard output is also not the reason.

I also tried running the job in parallel with multiple cores, it still failed with the same error after dozens of hours running.

I can make sure that the code I'm using has no problem because it can successfully finish on other clusters. The administrator of this cluster is looking into the issue but also cannot find out the specific reasons for now.

Has anyone ever run into this weird error? What should we do to fix this problem on the cluster? Any help is appreciated!

Stale file handle means that an open file was deleted. This can't happen with local files because the kernel doesn't remove the file until all file descriptors are closed. But it can happen with NFS if the file is deleted by the server or a different client, because NFS is stateless and the server doesn't know that clients have a file open. — Barmar, Nov 01 '16 at 20:53
But getting this error for the home directory is very strange. — Barmar, Nov 01 '16 at 20:54

score 1 · Answer 1 · answered May 01 '18 at 21:15

On academic clusters, home directories are frequently mounted via NFS on each node in the cluster to give you a uniform experience across all of the nodes. If this were not the case, each node would have its own version of your home directory, and you would have to take explicit action to copy relevant files between worker nodes and/or the login node.

It sounds like the NFS mount of your home directory on the worker node probably failed while your job was running. This isn't a problem you can fix directly unless you have administrative privileges on the cluster. If you need to make a work-around and cannot wait for sysadmins to address the problem, you could:

Try using a different network drive on the worker node (if one is available). On clusters I've worked on, there is often scratch space or other NFS directly under root /. You might get lucky and find an NFS mount that is more reliable than your home directory.
Have your job work in a temporary directory local to the worker node, and write all of its output files and logs to that directory. At the end of your job, you would need to make it copy everything to your home directory on the login node on the cluster. This could be difficult with ssh if your keys are in your home directory, and could require you to copy keys to the temporary directory which is generally a bad idea unless you restrict access to your keys with file permissions.
Try getting assigned to a different node of the cluster. In my experience, academic clusters often have some nodes that are more flakey than others. Depending on local settings, you may be able to request certain nodes directly, or potentially request resources that are only available on stable nodes. If you can track which nodes are unstable, and you find your job assigned to an unstable node, you could resubmit your job, then cancel the job that is on an unstable node.

The easiest solution is to work with the cluster administrators, but I understand they don't always work on your schedule.

Weird "Stale file handle, errno=116" on remote cluster after dozens of hours running

1 Answers1