1

stat() system call is taking long time when I am trying to do a stat on a file which is corrupted. Magic number is corrupted. I have a print after this call in my source code which is getting printed after some delay. I am not sure if stat() is doing any retry on the call. If any documentation available please share it. It would be great help.

It returned input output error. Error no 5 EIO. So i am not sure if the file or the filesystem is corrupted

  • 1
    The [`stat`](http://man7.org/linux/man-pages/man2/stat.2.html) system call doesn't read the actual file data, so if the file data is corrupted or not doesn't matter. It can matter if the *disk* is corrupt though. Check system log, and what e.g. [`dmesg`](http://man7.org/linux/man-pages/man1/dmesg.1.html) shows. – Some programmer dude Jul 21 '14 at 11:17
  • Refer the stat man page for more info about stat(). – Sathish Jul 21 '14 at 11:19
  • 2
    If you get `EIO` it probably means that the disk is corrupted or broken, or the disk controller is broken, or something else in your computer is broken. You should take a backup of as much as you can, then search for the problem, and how to do that is off-topic for SO. If you need help go to http://superuser.com/. – Some programmer dude Jul 21 '14 at 11:21
  • I got it Joachim.. but my question is..will this add a substantial dealy in returning the return value.?? – Abhinand Pl Jul 21 '14 at 11:24
  • If there's something wrong with the disk it can indeed delay *all* operations on that disk. It might even be possible that it can become more corrupted the more you access it. – Some programmer dude Jul 21 '14 at 11:28
  • Put your disk in the freezer for an hour then `dd` it to another drive before you loose _all_ of your data. Putting it in the freezer for an hour is less of a delay than sending it off to be recovered. – technosaurus Jul 21 '14 at 11:38
  • I know this is an old question, but I am curious as to what you mean by a "long" time? Typically, my calls to `stat` take 1 ms or less, but sometimes it goes to 90 ms, which seems like a long time to me in a time sensitive application. – Mark Lakata Jun 01 '23 at 18:24

1 Answers1

1

This can be caused by bad blocks on an aging or damaged spinning disk. There are two other symptoms that will likely occur concurrently:

  • Copious explicit I/O errors reported by the kernel in the system logs.

  • A sudden spike in load average. This happens because processes which are stuck waiting on I/O are in uninterrupted sleep while the kernel busy loops in an attempt to interact with the hardware, causing the system to become sluggish temporarily. You cannot stop this from happening, or kill processes in uninterrupted sleep. It's a sort of OS Achille's heel.

If this is the case, unmount the filesystems involved and run e2fsck -c -y on them. If it is the root filesystem, you will need to, e.g., boot the system with a live CD and do it from there. From man e2fsck:

-c

This option causes e2fsck to use badblocks(8) program to do a read-only scan of the device in order to find any bad blocks. If any bad blocks are found, they are added to the bad block inode to prevent them from being allocated to a file or directory. If this option is specified twice, then the bad block scan will be done using a non-destructive read-write test.

Note that -cc takes a long time; -c should be sufficient. -y answers yes automatically to all questions, which you might as well do since there may be a lot of those.

You will probably lose some data (have a look in /lost+found afterward); hopefully the system still boots. At the very least, the filesystems are now safe to mount. The disk itself may or may not last a while longer. I've done this and had them remain fine for months more, but don't count on it.

If this is a SMART drive, there are apparently some other tools you can use to diagnose and deal with the same problem, although what I've outlined here is probably good enough.

Community
  • 1
  • 1
CodeClown42
  • 11,194
  • 1
  • 32
  • 67