Search for file in archive and load it into memory

Question

Basically I need to load a file within an archive into memory, but since the user is able to modify the contents of the archive it is very likely that the file offset will change.

So I need to create a function that searches the archive for a file with the help of a hex pattern, returns the file offset, loads the file into memory and returns the file address.

To load a file into memory and return the address I currently use this:

DWORD LoadBinary(char* filePath)
{
    FILE *file = fopen(filePath, "rb");
    long fileStart = ftell(file);
    fseek(file, 0, SEEK_END);
    long fileSize = ftell(file);
    fseek(file, fileStart, 0);
    BYTE *fileBuffer = new BYTE[fileSize];
    fread(fileBuffer, fileSize, 1, file);
    LPVOID newmem = VirtualAlloc(NULL, fileSize, MEM_COMMIT, PAGE_EXECUTE_READWRITE);
    memcpy(newmem, fileBuffer, fileSize);
    delete[]fileBuffer;
    fclose(file);
    return (DWORD)newmem;
}

The archive is neither encrypted nor compressed, but it is pretty big (about 1 GB) and I'd like to not load the entire file into memory if possible.

I'm aware of the size of the file I'm looking for inside the archive so I don't need the function to find the end of the file with another pattern.

File Pattern: "\x30\x00\x00\x00\xA0\x10\x04\x00"

File Length: 4096 bytes

How can I realize this and what functions are needed?

Solution

The code is probably slow for large files, but this works for me since the file I'm looking for is at the beginning of the archive.

FILE *file = fopen("C:/data.bin", "rb");
fseek(file, 0, SEEK_END);
long fileSize = ftell(file);
rewind(file);

BYTE *buffer = new BYTE[4];
int b = 0; //bytes read
long offset = 0;

for (int i = 0; i < fileSize; i++)
{
    int input = fgetc(file);

    *(int *)((DWORD)buffer + b) = input;

    if (b == 3)
    {
        b = 0;
    }
    else {
        b = b + 1;
    }

    if (buffer[0] == 0xDE & buffer[1] == 0xAD & buffer[2] == 0xBE & buffer[3] == 0xEF)
    {
        offset = (ftell(file) - 4);
        printf("Match @ 0x%08X", offset);
        break;
    }
}
fclose(file);

How can the file be _pretty big (about 1gb)_ and same time the _File length: 4096 bytes_? And by the way, it's not 1**gb**, it's 1**GB**, a lowercase **b** represents bits, while a uppercase **B** represents bytes. And Giga has always the symbol of uppercase **G** — BufferOverflow, Jun 12 '15 at 11:46
The size of the archive is roughly 1 GB, the size of the file within the archive is 4096 bytes. Will edit the first post. — Jugyou, Jun 12 '15 at 11:49

score 1 · Accepted Answer · edited May 23 '17 at 11:51

The principle is stated in this answer: you need a finite state machine (FSM) which takes file bytes one by one as input and compares current input with a byte from the pattern according to FSM state, which is an index in the pattern.

Here is the simplest, but naive solution template:

FILE *file = fopen(path, "rb");
size_t state = 0;
for (int input_result; (input_result = fgetc(file)) != EOF;) {
    char input = (char)input_result;
    if (input == pattern[state]) {
        ++state;
    } else {
        state = 0;
    }
    if (pattern_index == pattern_size) {
        // Pattern is found at (ftell(file) - pattern_size).
        break;
    }
}
fclose(file);

The state variable holds position in the pattern, and it is the state of the FSM.

While this solution satisfies your needs, it is not optimal, because reading a byte from a file takes nearly the same time as reading a bigger block of, say, 512 bytes or even more. You can improve this yourself in two steps:

Each iteration read a block, not a single character. Use fread(). Note what calculation of pattern location (after it is found) becomes a bit more complicated, because ftell() no more matches the input location.
Add an inner loop to iterate through the block you've just read. Deal with input characters the same way as before—this is where FSM approach proves itself useful.

Thank you for your answer. I'm sure I understand how the code is supposed to work, but I couldn't implement it correctly. In the end I came up with something that is very likely to be slower, but works for me since the file I'm looking for is at the beginning of the archive. So the search only takes about 3 seconds. I will accept your answer because it steered me into the right direction. Again, thank you very much. — Jugyou, Jun 14 '15 at 12:08

Search for file in archive and load it into memory

1 Answers1