1

I'm trying to read data from a faulty external SSD to create an image for data recovery. The drive is an Apacer Panther SSD connected to a USB port via an ICY BOX SATA to USB connector on Ubuntu.

Executing the MWE below, read hangs at some address. The address is mostly stable between consecutive runs, but it can vary (e.g. on different days). With a block size of 1, read hangs on the first byte of some sector. The result is that the program freezes and no signal interrupts the read, ctrl-c simply prints "^C" to the terminal but does not kill the program and the alarm's handler is never called.

Closing the terminal and re-running the program on a new terminal, no read is completed (it hangs on the first iteration). Only by disconnecting and reconnecting the SSD can I read again from the disk. However, if I disconnect the drive while read is blocked, the program continues.

Modifying and running the program with stdin as the file descriptor, both SIGINT and SIGALRM interrupt read.

So the question is: a) Why does read block indefinitely since according to the man page it is interrupted by signals? b) Is there any way to fix this?

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/select.h>
#include <unistd.h>
#include <errno.h>
#include <signal.h>

void sig_handler(int signum){
    printf("Alarm handler\n");
}

int main(int argc, char *argv[]) {

    // Register ALARM signal handler to prevent read() from blocking indefinitely
    struct sigaction alarm_int_handler = {.sa_handler=sig_handler};
    sigaction(SIGALRM, &alarm_int_handler, 0);
    
    char* disk_name = "/dev/sdb";
    const int block_size = 512;
    int offset = 0;
    
    char block[block_size];

    // Open disk to read as binary file via file descriptor
    int fd = open(disk_name, O_RDONLY | O_NONBLOCK);
    if (fd == -1){
        perror(disk_name);
        exit(0);
    }

    int i;
    int position = offset;

    for (i=0; i<100000; i++){

        // Reset alarm to 1 sec (to interrupt blocked read)
        alarm(1);

        // Seek to current position
        int seek_pos = lseek(fd, position, SEEK_SET);
        if (seek_pos == -1){
            perror("Seek");
        }

        printf("Reading... ");
        fflush(stdout);
        int len = read(fd, block, block_size);
        printf("Read %d chars at %d\n", len, position);

        if (len == -1){
            if (errno != EINTR){
                perror("Read");
            }
            else {
                printf("Read aborted due to interrupt\n");
                // TODO: handle it
            }
        }

        position += len;
        
    }

    close(fd);

    printf("Position %d (%d)\n", position, i * block_size);
    printf("Done\n");
    return 0;
}

Output on the terminal looks like this

.
.
.
Reading... Read 1 chars at 29642749
Reading... Read 1 chars at 29642750
Reading... Read 1 chars at 29642751
Reading...
Alex P
  • 1,105
  • 6
  • 18
  • 5
    Not an answer to your question, but [ddrescue](https://www.gnu.org/software/ddrescue/) is an excellent tool for what you are trying to achieve. – pmacfarlane Aug 30 '23 at 12:25
  • What happens if you increase the block size to something like 512 or 4096 or so? Reading one byte at a time will not increase chances of data recovery, because data is transmitted from SSD to RAM in larger blocks anyway. It will just slow things down to a crawl because you need a billion of system calls to read a single gigabyte of data. – Fritz Aug 30 '23 at 12:31
  • @pmacfarlane I did not know ddrescue, I'll sure check it out. But I did use dd at first (not reinventing the wheel logic) and it also hang at similar addresses – Alex P Aug 30 '23 at 12:32
  • @Fritz Of course I'm not trying to create the image with a block size of 1. This is just to take a finer-grain look into what happens. Larger block sizes produce the same results (of course we can't talk about "first sector byte" in these cases though). – Alex P Aug 30 '23 at 12:34
  • I've had to do such recovery before. I wrote a custom program that would read a disk and retry a bad block a few thousand times. The program would do any kind of device reset if needed. Then, after the ordeal was complete (and I had my data back), I heard about [as pmacfarlane mentioned]: `ddrescue` Note that `ddrescue` is _not_ just a simple `dd` clone. It has massive amounts of retry/recovery logic in it (e.g. IIRC, it reads sectors/tracks in forward and reverse order, etc.). – Craig Estey Aug 30 '23 at 22:49
  • If a controller/card is freezing up, you may want to get a special board that can selectively reset/power-cycle the locked up H/W when your program detects the lockup. Either the flash memory, usb adapter, or SATA-to-usb. I assume the adapter has separate power from an AC source (i.e. it does _not_ just depend upon power from the USB cable on your PC). Or, if you have a desktop PC lying around [or score one on e-bay], you may want to plug the SSD directly into a SATA disk adapter card (i.e. bypassing the SATA-to-USB step entirely). – Craig Estey Aug 30 '23 at 23:00

2 Answers2

2

That sounds like your SSD might be defective (fails to respond to a request, e.g. its firmware hangs while trying to recover from corrupt data in flash memory) or the kernel driver has a bug.

As to why the process does not respond to signals: There is a process state called "uninterruptible sleep" (abbreviated as state D in top and htop). Processes go into this state when their control flow is inside the kernel (i.e. during a system call like read), for example waiting for data from a disk or network (NFS mounts are infamous for this during a network outage). If your SSD does not reply to a data request, then the process would wait for data indefinitely, since the kernel will not ask the SSD a second time. Or maybe it does, and the SSD always refuses to answer, or might even time out after a few hours of trying... who knows.

Fritz
  • 1,293
  • 15
  • 27
  • With htop on a different terminal, the program does indeed go into the D state when it hangs. – Alex P Aug 30 '23 at 12:40
  • @AlexP Well, that settles it then, i guess. If the device never recovers, then there is no way other than skipping the address ranges that are known make the SSD hang. If I remember correctly, `ddrescue` has such a feature, i.e. reading a list or problematic sectors from a file, or similar. – Fritz Aug 30 '23 at 12:53
-1

It might be a kernel driver bug.

Did you try non-blocking reads? Regular files cannot be polled, but the descriptor can still be made non-blocking

rostamn739
  • 323
  • 3
  • 8
  • I am using non-blocking reads, since I open the file with the O_NONBLOCK flag – Alex P Aug 30 '23 at 12:29
  • 1
    Do we really expect a bug in the kernel driver if we have a generic SSD in use? An unnumbered amount of devices is in the market and running quite fine with linux. So why OP should find such a reproducible bug by simply reading a single block? That will happen by every access from the uper file system layers also... – Klaus Aug 30 '23 at 12:33