This can be caused by bad blocks on an aging or damaged spinning disk. There are two other symptoms that will likely occur concurrently:
Copious explicit I/O errors reported by the kernel in the system logs.
A sudden spike in load average. This happens because processes which are stuck waiting on I/O are in uninterrupted sleep while the kernel busy loops in an attempt to interact with the hardware, causing the system to become sluggish temporarily. You cannot stop this from happening, or kill processes in uninterrupted sleep. It's a sort of OS Achille's heel.
If this is the case, unmount the filesystems involved and run e2fsck -c -y
on them. If it is the root filesystem, you will need to, e.g., boot the system with a live CD and do it from there. From man e2fsck
:
-c
This option causes e2fsck to use badblocks(8) program to do a read-only scan of the device in
order to find any bad blocks. If any bad blocks are found, they are added to the bad block
inode to prevent them from being allocated to a file or directory. If this option is specified twice, then the bad block scan will be done using a non-destructive read-write test.
Note that -cc
takes a long time; -c
should be sufficient. -y
answers yes automatically to all questions, which you might as well do since there may be a lot of those.
You will probably lose some data (have a look in /lost+found
afterward); hopefully the system still boots. At the very least, the filesystems are now safe to mount. The disk itself may or may not last a while longer. I've done this and had them remain fine for months more, but don't count on it.
If this is a SMART drive, there are apparently some other tools you can use to diagnose and deal with the same problem, although what I've outlined here is probably good enough.