3

I don't like to just dump a load of code in here and ask people to debug it for me, but I'm a bit inexperienced with C and I'm totally stumped.

The overall aim is to do a bit of cleanup on a very large log file (11G+), I'm reading in 2048 bytes at a time, then scanning individual lines, writing them to an output file. I was originally using strstr to find line endings, however I found this didn't work with the partial line at the end of the read buffer - I think this is because the "string" I'm reading from the file doesn't have a \0 at the end of it, and strstr gets confused.

So, after a bit of googling I thought I'd try memmem, which seems to be a "binary safe" drop-in replacement for strstr. This is where I'm stuck, my program is segfaulting during the call to memmem.

#include <stdio.h>
#include <string.h>

#define BUFF_LEN 2048

int main (void)
{
    char file_buff[BUFF_LEN], prev_line[BUFF_LEN], curr_line[BUFF_LEN];
    char *p_line_start, *p_lf;
    int bytes_consumed, bytes_read;
    FILE *in_fp, *out_fp;

    in_fp = fopen("208.log", "r");
    out_fp = fopen("expanded.log", "w+");

    int sane = 0;
    while (1) {
        bytes_read = fread(file_buff, 1, BUFF_LEN, in_fp);
        if (bytes_read == 0) {
            break;
        }

        // Set the pointer to the beginning of the file buffer
        p_line_start = file_buff;
        bytes_consumed = 0;

        // Chomp lines
        while (bytes_consumed < bytes_read) {
            printf("Read to go with bytes_read = %d, bytes_consumed = %d\n",
                bytes_read, bytes_consumed);
            p_lf = (char *) memmem(p_line_start, bytes_read - bytes_consumed,
                "\n", 1);
            if (p_lf == NULL) {
                // No newline left in file_buff, store what's left in
                // curr_line and break out to read more from the file.
                printf("At loop exit I have chomped %ld of %d\n",
                    p_line_start - file_buff, bytes_read);
                //break;
                goto cleanup;
            }
            // Copy the line to our current line buffer (including the newline)
            memcpy(curr_line, p_line_start, p_lf - p_line_start + 1);
            printf("Chomped a line of length %ld\n", p_lf - p_line_start + 1);
            fwrite(curr_line, 1, p_lf - p_line_start + 1, out_fp);
            p_line_start = p_lf + 1;
            bytes_consumed += p_lf - p_line_start + 1;
        }

Can anyone throw me a line here?!
Tips on how to better debug this for myself are also welcome.

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
Robin
  • 4,242
  • 1
  • 20
  • 20
  • If you can use `mmap`, you may want to use that to map the whole file into memory. Then, you don't have to worry about partial files. (`mmap` lazy-loads memory pages, so you also don't have to worry about blowing your RAM on an 11GB file). – nneonneo Sep 15 '14 at 20:21
  • Have you tried running this in a debugger? Then you can see exactly where it crashes, and you can look at the values of the variables. – Thomas Padron-McCarthy Sep 15 '14 at 20:22
  • **The last two lines you show seem to be swapped**. Why are you casting the return-value of `memmem`? See [Don't cast the result of malloc (and friends)](http://stackoverflow.com/q/605845). Also, why not `memchr`? BTW: `for(;;)` is more idiomatic than `while(1)`. I guess you already are using `-Wall -Wextra -pedantic -std=c11`... – Deduplicator Sep 15 '14 at 20:34
  • 1
    You're overcomplicating things by using `fread` instead of `fgets` to read the file. Of course, you may still need to deal with line truncation when using `fgets`. How long is the longest line in the input file? Or is that an unknown? – user3386109 Sep 15 '14 at 20:42
  • @nneonneo - thanks for the top - I'll look in to that. – Robin Sep 15 '14 at 20:55
  • @Thomas Padron-McCarthy - yeah, I ran it through gdb but the crash is inside the memmem function. – Robin Sep 15 '14 at 20:56
  • @Deduplicator - I'm casting the return value because gcc was kicking out warnings: "warning: assignment makes pointer from integer without a cast". Thanks for the other tips, the additional GCC args spat out a " warning: implicit declaration of function ‘memmem’" (which seems weird) – Robin Sep 15 '14 at 20:59
  • @user3386109 - I don't know how long the longest line will be. I was hoping that by reading in big chunks of file like this the program would be faster than reading a line at a time. – Robin Sep 15 '14 at 21:01
  • Nothing weird about it, you didn't define `_GNU_SOURCE` prior to including ``: http://man7.org/linux/man-pages/man3/memmem.3.html Another reason to use [`memchr`](http://man7.org/linux/man-pages/man3/memchr.3.html) instead. – Deduplicator Sep 15 '14 at 21:02
  • OK, memchr looks promising, I'll give that a go tomorrow night. Thanks! – Robin Sep 15 '14 at 21:09
  • 1
    I expect that the file I/O driver is reading the file in big chunks anyways, so it's doubtful that you'll see a major performance difference between `fread` and `fgets`. As for the longest line, it's not unreasonable to have a line buffer of 5MB, e.g. `char *buffer = malloc(5000000); while (fread(buffer,5000000,fp_in)!=NULL){...}` So I guess my question is, do you expect that the log file will have lines that are longer than 5MB? – user3386109 Sep 15 '14 at 21:10
  • You could use `strstr` just fine if you'd reserve an extra byte at the end of your buffer and set its value to 0. Then if `strstr` looking for a newline returned NULL, you'd know you were at the partial line at the end and could copy it into the beginning of the buffer and read another chunk, or however you wanted to handle partial lines. Assuming your log file is really just text. – indiv Sep 15 '14 at 21:16
  • 1
    @indiv: that is possible, and it can work with a buffer at least as long as the search string (possibly shorter -- my mind shrinks back from working that one out, though), but the book-keeping overhead is huge, easy to get wrong, and hard to debug. I'd go for a large one-line input buffer as well. And then double its size, just for safety. – Jongware Sep 15 '14 at 21:35

1 Answers1

2

From one of your comments:

I'm casting the return value because gcc was kicking out warnings: "warning: assignment makes pointer from integer without a cast".

You are just hiding the problem by casting the return value.

memmem returns a pointer. Typically today a pointer is 64 bits. If you haven't declared the function, the compiler doesn't know it returns a pointer, and instead assumes it returns an integer. Typically today an integer is 32 bits. The generated code will look in the place where that integer would have been returned, and take 32 bits from there. What it will actually get is half the returned pointer.

Try to add this line just after your call to memmem, and see if the printouts differ if you declare or don't declare memmem:

printf("[p_lf = %p]\n", (void*)p_lf);

When I ran it, with your original program (without a declaration), it printed 0xffffffffffffda67, and then crashed, because that was an invalid pointer. With a declaration (using #define _GNU_SOURCE) it printed 0x7fffffffda67, and didn't crash. Notice that if you take only the 32 lower bits of 0x7fffffffda67 you get 0xffffda67, and if you then expand it to 64 bits, you get 0xffffffffffffda67, the pointer from your original program. (Address space layout randomization turned off.)

That's why you shouldn't cast return values.

Thomas Padron-McCarthy
  • 27,232
  • 8
  • 51
  • 75
  • The funny thing is that *man memmem* instructs to define `_GNU_SOURCE`, but in fact that function is only declared if `__USE_GNU` is defined. Symbol highlighting and popup syntax hints in IDEs make this even more confusing by giving the user false sense of that everything is fine. – Anton Samsonov Jun 27 '16 at 11:25