0

I'm writing an application that deals with very large user-generated input files. The program will copy about 95 percent of the file, effectively duplicating it and switching a few words and values in the copy, and then appending the copy (in chunks) to the original file, such that each block (consisting of between 10 and 50 lines) in the original is followed by the copied and modified block, and then the next original block, and so on. The user-generated input conforms to a certain format, and it is highly unlikely that any line in the original file is longer than 100 characters long.

Which would be the better approach?

  1. To use one file pointer and use variables that hold the current position of how much has been read and where to write to, seeking the file pointer back and forth to read and write; or

  2. To use multiple file pointers, one for reading and one for writing.

I am mostly concerned with the efficiency of the program, as the input files will reach up to 25,000 lines, each about 50 characters long.

tshepang
  • 12,111
  • 21
  • 91
  • 136
  • 2
    25,000 * 51 = 1,275,000 isn't large :) you can create the new part of the file in an in-memory buffer and write it out with a single `(f)write` call. – chill Nov 21 '12 at 17:02

2 Answers2

4

If you have memory constraints, or you want a generic approach, read bytes into a buffer from one file pointer, make changes, and write out the buffer to a second file pointer when the buffer is full. If you reach EOF on the first pointer, make your changes and just flush whatever is in the buffer to the output pointer. If you intend to replace the original file, copy the output file to the input file and remove the output file. This "atomic" approach lets you check that the copy operation took place correctly before deleting anything.

For example, to deal with generically copying over any number of bytes, say, 1 MiB at a time:

#define COPY_BUFFER_MAXSIZE 1048576

/* ... */

unsigned char *buffer = NULL;
buffer = malloc(COPY_BUFFER_MAXSIZE);
if (!buffer)
    exit(-1);

FILE *inFp = fopen(inFilename, "r");
fseek(inFp, 0, SEEK_END);
uint64_t fileSize = ftell(inFp);
rewind(inFp);

FILE *outFp = stdout; /* change this if you don't want to write to standard output */

uint64_t outFileSizeCounter = fileSize; 

/* we fread() bytes from inFp in COPY_BUFFER_MAXSIZE increments, until there is nothing left to fread() */

do {
    if (outFileSizeCounter > COPY_BUFFER_MAXSIZE) {
        fread(buffer, 1, (size_t) COPY_BUFFER_MAXSIZE, inFp);
        /* -- make changes to buffer contents at this stage
           -- if you resize the buffer, then copy the buffer and 
              change the following statement to fwrite() the number of 
              bytes in the copy of the buffer */
        fwrite(buffer, 1, (size_t) COPY_BUFFER_MAXSIZE, outFp);
        outFileSizeCounter -= COPY_BUFFER_MAXSIZE;
    }
    else {
        fread(buffer, 1, (size_t) outFileSizeCounter, inFp);
        /* -- make changes to buffer contents at this stage
           -- again, make a copy of buffer if it needs resizing, 
              and adjust the fwrite() statement to change the number 
              of bytes that need writing */
        fwrite(buffer, 1, (size_t) outFileSizeCounter, outFp);
        outFileSizeCounter = 0ULL;
    }
} while (outFileSizeCounter > 0);

free(buffer);

An efficient way to deal with a resized buffer is to keep a second pointer, say, unsigned char *copyBuffer, which is realloc()-ed to twice the size, if necessary, to deal with accumulated edits. That way, you keep expensive realloc() calls to a minimum.

Not sure why this got downvoted, but it's a pretty solid approach for doing things with a generic amount of data. Hope this helps someone who comes across this question, in any case.

Alex Reynolds
  • 95,983
  • 54
  • 240
  • 345
1

25000 lines * 100 characters = 2.5MB, that's not really a huge file. The fastest will probably be to read the whole file in memory and write your results to a new file and replace the original with that.

Stefan Friesel
  • 823
  • 5
  • 19