1

started working on a simple C program to scan for images files on a USB disk(BMP,Jpeg etc..). I completed the header files that will contain the image metadata.

my questions are regarding scanning the usb drive. how will the program know when it reaches the end of file. i am treating the usb drive like a file. I plan to read the raw data bytes using fread.

FILE usb_ptr = fopen(argv[1],"r");
if(usb_ptr == NULL){
    printf("error opening USB Drive for reading");
    fclose(usb_ptr);
}
  //I manually give the mount location, on fedora usb drives are mounted at
  //         /run/media/user1/USBDRIVE by default

struct header1 header1;
struct header2 header2;
struct colours colours;
int file_count=0;

fread(&header1,sizeof(header1),1,usb_ptr);
fread(&header2,sizeof(header2),1,usb_ptr);

after copying the first few bytes of the USB disk, perform a check to see if we found a BMP file, if its not a BMP scan in the next few bytes and so on.

 if (header1.signature != 0x4d42 || header1.data_offset != 54 ){
      int file_size = header1.file_size;
      file_count++;
  //there are more checks trying to keep this post short

1)I plan to iterate this process until I reach the end of the file. but how to I determine when the usb_ptr is at the end(i finished scanning the USB)?

2) I am pretty sure there will be "EOF" characters in the memory of the usb disk, how do I know for certain I have reached the end of the disk or just read some random byte on the usb disk?

3) should I go about this in a different way?

(the code above is not complete just snippets, also there is another section where i copy the image found on the usb disk to my hard disk this program is pretty much to recover images from a drive hope to add more file types later)

thanks.

tesseract
  • 891
  • 1
  • 7
  • 16
  • As I understand it, you are not accessing the individual files on your USB (which in itself is just a regular FS) but instead reading "raw" data straight off the chips. But what about file fragmentation? – Jongware Dec 29 '13 at 16:40
  • yes, I am reading raw data straight off the disk, I wasn't aware about fragmentation issues, some info would be useful, anyway doing some googling on it now!! thanks!! – tesseract Dec 29 '13 at 16:49
  • 2
    In short: as on any FS, the files on your USB may be fragmented. The FS metadata sorts that out for you when accessing files the usual way. Now, when finding a sector with the correct starting data for a BMP, you cannot rely that the *next* sector(s) belong to the same file. – Jongware Dec 29 '13 at 16:53
  • 1
    (Assuming the data is stored sector-wise) .. If you find a valid BMP header: this contains the length of the following data. You could read in that chunk in whole sectors and do a rigorous test if it's valid. If not, the data is not contiguous and you should ignore this BMP header and continue with the next sector. – Jongware Dec 29 '13 at 16:57
  • 1
    Okay, no need to test every single byte: "Sectors are 512 bytes long, for compatibility with hard drives" ([wikipedia:USB](http://en.wikipedia.org/wiki/USB_flash_drive)). Read chunks of 512 bytes and test if the first few match a BMP header. If not, skip it entirely. – Jongware Dec 29 '13 at 17:03
  • that's some useful info, I can see my program already flawed,I need to do more reading on this, seems like reading every sector as you say is much more efficient and the way to go about this. – tesseract Dec 29 '13 at 17:22
  • Glad I could help! A final answer to q.2: I presume `fread` will do what it always does at the end of a "file" (in this case, disk) and return the number of read bytes, i.e., most likely "0" (if you are reading per 512 bytes). – Jongware Dec 29 '13 at 17:25

2 Answers2

1

A partial answer:

EOF is not a valid character. There are never any EOF characters inside a file, or on the disk. EOF is a value some functions return when you get to the end of the file. getchar,for example, returns an int, and not a char, for this reason: so that it can returns -1, which is not a valid char value. See here for more info.

Community
  • 1
  • 1
OlivierH
  • 385
  • 3
  • 12
  • thanks brushed up my EOF stuff, think i found what i was looking for in this http://cboard.cprogramming.com/c-programming/96514-what-does-fread-return-end-file.html – tesseract Dec 29 '13 at 16:00
1

My comments, summarized:

  1. fread will do what it always does at the end of a "file" (in this case, disk) and return the number of read bytes, i.e., most likely "0" (if you are reading per 512 bytes).

  2. EOF is not a 'byte' value you should be looking for, rather, it indicates a state. Use feof to explicitly test, or just check the return value of fread.

  3. Currently you are checking each and every single byte. But the data is not stored in any random order! USB sticks store data in sectors, each one 512 bytes long: "Sectors are 512 bytes long, for compatibility with hard drives" (wikipedia on USB flash drive).

  4. You cannot assume contiguous sectors belong to the same file due to fragmentation. If a file is fragmented, there is no automatic way to automatically merge the sectors in the correct order ... (Doing it manually is usually out of the question. I'd consider doing that only if the original file contains easy recognizable data such as plain text, and the contents are extremely important :) .)

You can read a sector -- 512 bytes -- and stop if you encounter EOF. If this sector starts with the two signature bytes for a BMP, you can inspect it further to verify it is a BMP header, and if so, you can use the BMP structure data to check if all next sectors contain a valid BMP file. The only way to do so is:

  • the first sector contains all relevant BMP metrics: data size indicates the original pixel size, and you should read that much extra data.
  • using the BMP file specifications, check if:
    • width times height times bytes per pixel equals total size
    • data does not contain out-of-range values (not possible for 24 bit images, though)
    • data is aligned to a DWORD per scan-line

If you accept the BMP as 'possible correct', you can save it to disk and verify by eye if it seems correct. Then:

  • you are 100% sure this file is well-formed; or
  • another image may start "inside" this one's data part due to fragmentation.

If it isn't a well-formed BMP image, or you want a thorough check of every sector, continue scanning with the next sector. If you are sure the image is well-formed throughout or you want to speed up scanning, you can skip (datasize+sectorsize-1)/sectorsize sectors.

The simple C program below scans an entire disk and if it seems to indicate a BMP file start, it prints out the first 32 bytes in human readable form. For my test disk, it gave the following output:

42 4D D8 49 EE 0E E8 B9 7A BE F3 7C DF FD 7E F7 77 9F 7B FF 38 7F F0 3C 24 33 B3 66 AD 77 BD 6B | BM.I....z..|..~.w.{.8..<$3.f.w.k
42 4D 6E E6 E3 D3 48 37 A5 27 D7 6F EF 49 4E 13 E0 A7 DF 78 47 8E 5E 3C 95 B5 0A 16 D2 5C CE 3A | BMn...H7.'.o.IN....xG.^<.....\.:
42 4D 36 00 24 00 00 00 00 00 36 00 00 00 28 00 00 00 00 04 00 00 00 03 00 00 01 00 18 00 00 00 | BM6.$.....6...(.................
42 4D 49 2C 20 62 6F 64 79 20 6D 61 73 73 20 69 6E 64 65 78 3B 20 41 53 41 2C 20 41 6D 65 72 69 | BMI, body mass index; ASA, Ameri
42 4D 50 66 6F 67 6C 65 00 00 00 00 00 00 29 1E 00 01 DC F8 BC 84 91 AE BC 84 91 AE 00 04 00 00 | BMPfogle......).................

The weird thing is, initially it contained no BMP files so I copied one to test with. Now how come there are more than one candidates? (There were actually 9 more.) First, there are "false positives" -- the "BMI" one is a nice example --, but second: if there is a deleted BMP file somewhere on that disk and its first sector happens to not have been overwritten, it will also be listed!

Short & rough sample code:

#include <stdio.h>

int main (int argc, char **argv)
{
    FILE *usb_ptr;
    unsigned char buffer[512];
    int i, j;

    if (argc == 1)
    {
        printf ("wot no stick?\n");
        return -1;
    }
    usb_ptr = fopen(argv[1],"rb");
    if(usb_ptr == NULL)
    {
        printf("error opening USB Drive for reading");
    }

    i = 0;
    while (1)
    {
        if (fread (buffer, 512,1, usb_ptr) < 1)
            break;
        i++;
        if (!(i & 127))
            printf ("%d sectors read..\r", i);
        if (buffer[0] == 'B' && buffer[1] == 'M')
        {
            for (j=0; j<32; j++)
                printf ("%02X ", buffer[j]);
            printf ("| ");
            for (j=0; j<32; j++)
            {
                if (buffer[j] >= ' ' && buffer[j] <= '~')
                    printf ("%c", buffer[j]);
                else
                    printf (".");
            }
            printf ("\n");
        }
    }

    fclose (usb_ptr);

    return 0;
}

(Afterthought) It's pretty slow for a 1Gb disk .. perhaps it's faster to read more sectors at once. (Testing..) yup, way faster to read even as little as 10 sectors inside the loop.

Jongware
  • 22,200
  • 8
  • 54
  • 100
  • that's some excellent stuff there,thanks for taking the time @jongware, I was on google trying to figure out how to combine fragmented files. seems you are right there again, the complexity is over the top!!, I didnt think of the deleted BMP file aswell..this is getting rather complicated/interesting for my weekend project..gonna soak in all you said and see what I can come up with. – tesseract Dec 29 '13 at 18:47
  • It worked only for files (in usb drive). When I select Drive I'm getting fopen null – Developer2012 Mar 13 '21 at 18:25