1

So, I'm writing this simple HTTP client in C and I seem to be stuck on this problem - how do I strip the HTTP headers from the response? After all, if I get a binary file I can't just write the headers out to my output file. I can't seem to go in once the data is already written to a file because linux screams when you try to even view the first few lines of a binary file, even if you know they're just text HTTP headers.

Now, here's the rub (well, I suppose the whole thing is a rub). Sometimes the whole header doesn't even in come in on the first response packet, so I can't even guarantee that we'll have the whole header in our first iteration (that is, iteration of receiving an HTTP response. We're using recv(), here), which means I need to somehow... well, I don't even know. I can't seem to mess with the data once it's already written to disk, so I need to deal with it as it's coming in, but we can't be sure how it's going to come in, and even if we were sure, strtok() is a nightmare to use.

I guess I'm just hoping someone out there has a better idea. Here's the relevant code. This is really stripped down, I'm going for MCVE, of course. Also, you can just assume that socket_file_descriptor is already instantiated and get_request contains the text of our GET request. Here is it:

FILE* fp = fopen("output", "wb"); // Open the file for writing
char buf[MAXDATASIZE]; // The buffer
size_t numbytes; // For the size of the response

/*
 * Do all the socket programming stuff to get the socket file descriptor that we need
 * ...
 * ...
*/

send(socket_file_descriptor, get_request, strlen(get_request), 0); // Send the HTTP GET request

while ((numbytes = recv(socket_file_descriptor, buf, MAXDATASIZE - 1, 0)) > 0) {
    /* I either need to do something here, to deal with getting rid of the headers before writing to file */
    fwrite(buf, 1, numbytes, fp); // Write to file
    memset(buf, 0, MAXDATASIZE); // This just resets the buffer to make room for the next packet
}
close(s);
fclose(fp);
/* Or I need to do something here, to strip the file of its headers after it's been written to disk */

So, I thought about doing something like this. The only thing we know for sure is that the header is going to end in \r\n\r\n (two carriage returns). So we can use that. This doesn't really work, but hopefully you can figure out where I'm trying to go with it (comments from above removed):

FILE* fp = fopen("output", "wb");
char buf[MAXDATASIZE];
size_t numbytes;
int header_found = 0; // Add a flag, here

/* ...
 * ...
*/

send(socket_file_descriptor, get_request, strlen(get_request), 0);

while ((numbytes = recv(socket_file_descriptor, buf, MAXDATASIZE - 1, 0)) > 0) {
    if (header_found == 1) { // So this won't happen our first pass through
        fwrite(buf, 1, numbytes, fp);
        memset(buf, 0, MAXDATASIZE);
    }
    else { // This will happen our first pass through, maybe our second or third, the header doesn't always come in in full on the first packet
        /* And this is where I'm stuck.
         * I'm thinking about using strtok() to parse through the lines, but....
         * well I just can't figure it out. I'm hoping someone can at least point
         * me in the right direction.
         *
         * The point here would be to somehow determine when we've seen two carriage returns
         * in a row and then mark header_found as 1. But even if we DID manage to find the
         * two carriage returns, we still need to write the remaining data from this packet to 
         * the file before moving on to the next iteration, but WITHOUT including the
         * header information.
        */
    }
}
close(s);
fclose(fp);

I've been staring at this code for three days straight and am slowly losing my mind, so I really appreciate any insight anyone is able to provide. To generalize the problem, I guess this really comes down to me just not understanding how to do text parsing in C.

  • 2
    Is this "How do I write an HTTP parser in C?" If so, that's a *lot* to work through in one question. If you've never written a parser before, start with something simple, like a line-delimited parser that splits into lines. From there, parse headers by correctly splitting the header name from header value. – tadman Sep 07 '20 at 20:40
  • Hmmm can you point me to any resources on that? I'm really not too experienced with C. Right now my solution (we'll see if I'm able to implement it) is go through byte by byte and find 13 10 13 10 (`\r \n \r \n`), and then somehow just write out the rest of the current thing from there. – ALittleHelpFromMyFriends Sep 07 '20 at 20:59
  • 1
    But yes, I think this question really does just come down to "How do I write an HTTP parser in C." Well, not even, I don't care about the information in the header, I just want to lop it off and only take the message body. Couldn't care less what the headers actually say. It's really just "how do I parse this buffer." – ALittleHelpFromMyFriends Sep 07 '20 at 21:00
  • 2
    In C this generally plays out as simple state machines, where you loop over the content and branch to different states based on character matches inside a `switch` statement. There are innumerable HTTP parsers out there, many in C, which are open-source and easily obtained for inspiration. There's surely also dozens of well-written examples that walk you through this. – tadman Sep 07 '20 at 21:00
  • 3
    If you're just looking for the end of the headers, use `strstr` to look for `CRLFCRLF` and there's your data. The headers terminate with that sequence. – tadman Sep 07 '20 at 21:01
  • Thanks, @tadman! I ended up going about it another way (see below), I'm sure you're way makes way more sense tho. – ALittleHelpFromMyFriends Sep 07 '20 at 21:40
  • 1
    @ALittleHelpFromMyFriends See [my answer](https://stackoverflow.com/a/16247097/65863) to [Differ between header and content of http server response (sockets)](https://stackoverflow.com/questions/16243118/) – Remy Lebeau Sep 08 '20 at 17:09

3 Answers3

1

The second self-answer is better than the first one, but it still could be made much simpler:

const char* pattern = "\r\n\r\n";
const char* patp = pattern;
while ((numbytes = recv(socket_file_descriptor, buf, MAXDATASIZE - 1, 0)) > 0) {
    for (int i = 0; i < numbytes; i++) {
        if (*patp == 0) {
            fwrite(buf + i, 1, numbytes - i, fp);
            break;
        }
        else if (buf[i] == *patp) ++patp;
        else patp = pattern;
    }
    /* This memset isn't really necessary */
    memset(buf, 0, MAXDATASIZE);
}

That looks like a general solution, but it's not really: there are values for pattern for which it might fail to see a terminator under particular circumstances. But this particular pattern is not problematic. You might want to think about what sort of pattern would cause a problem before taking a look at the more general solution.

rici
  • 234,347
  • 28
  • 237
  • 341
  • Can you explain what’s going on in there a little bit? I’m still tryna wrap my head around pointers.... under which situation would patp be set to zero? – ALittleHelpFromMyFriends Sep 09 '20 at 13:11
  • 1
    @ALittleHelpFromMyFriends: when it points at the NUL terminator at the end of `pattern`. (`*patp == 0` doesn't test if `patp` *is* 0. It tests whether it *points to* a 0. Those are very different tests.) – rici Sep 09 '20 at 13:18
  • So in this context `0` and the string terminating byte `\0` are the same? – ALittleHelpFromMyFriends Sep 09 '20 at 14:31
  • 1
    `\0` means "the character whose code is 0". So, yes. And not just in this context. C character literals have type `int` – rici Sep 09 '20 at 14:37
  • Hm. I'm not quite sure I get it, but I implemented it and it worked, so, thank you!! – ALittleHelpFromMyFriends Sep 09 '20 at 16:40
  • 1
    @ALittleHelpFromMyFriends: All it is doing is using the termination pattern to contain the individual characters, so that instead of hardcoding the characters into the code (which is complicated), it just steps through the pattern. In effect, the pointer `patp` could be your counter (and I could have implemented it as a counter, but why?). Doing it this way means that I can use exactly the same code with a different terminator sequence, without even having to know how long the terminator sequence is. (But watch out for the note at the end of the answer: not *every* pattern works.) – rici Sep 09 '20 at 16:55
  • 1
    @ALittleHelpFromMyFriends: If you're going to use C for string functions, you need to have a clear idea what a pointer is, what a null-terminator is, and how they work together to avoid having to constantly count the length of a string. None of it is complicated. It's just how you look at the problem. – rici Sep 09 '20 at 16:56
  • 1
    For example, why did I just drop your boolean `header_found`? Because the test is so simple (`*patp == 0`) that nothing is saved by caching a boolean value. Without that boolean, you would just end up doing the same test twice, once before the loop and once at the beginning of the loop. That's pointless, so I could just eliminate the redundant test along with the boolean. – rici Sep 09 '20 at 16:59
  • I do see why you dropped the `header_found` boolean, that does make sense - once `*patp == 0` it'll stay like that, so subsequent iterations of the greater while loop will also enter the for loop with `i = 0`, and break out of the for loop from there. Let me see if I got this. `*patp` begins by equaling pattern, `\r\n\r\n`. Or, rather, `*patp` begins by pointing to the beginning of pattern? And each time you find a character that matches the current first character (`buf[i] == *patp`), you iterate `*patp` to it's next character, until you get to the NUL character `\0` ... something like that. – ALittleHelpFromMyFriends Sep 09 '20 at 19:34
  • 1
    @ALittleHelpFromMyFriends: Yes, exactly. `patp - pattern` is always exactly the same as the value of your `consec_success`. But I don't have to use any logic to figure out which character it corresponds to; I just dereference it. – rici Sep 09 '20 at 21:20
  • Well that's very helpful, thank you so much for taking the time to go over it with me! I really appreciate it. – ALittleHelpFromMyFriends Sep 09 '20 at 22:47
0

So, I know this is not the most elegant way to go about this, but... I did get it. For anyone who finds this question and is curious about at least an answer, here it is:

int count = 0;
int firstr_found = 0;
int firstn_found = 0;
int secondr_found = 0;
int secondn_found = 0;
FILE* fp = fopen("output", "wb");
char buf[MAXDATASIZE];
size_t numbytes;
int header_found = 0;

/* ...
 * ...
*/

send(socket_file_descriptor, get_request, strlen(get_request), 0);

while ((numbytes = recv(socket_file_descriptor, buf, MAXDATASIZE - 1, 0)) > 0) {
    if (header_found == 1) {
        fwrite(buf, 1, numbytes, fp);
    }
    else {
        // These buf[i]'s are going to return as integers (ASCII)
        // \r is 13 and \n is 10, so we're looking for 13 10 13 10
        // This also needs to be agnostic of which packet # we're on; sometimes the header is split up.
        for (int i = 0; i < numbytes; i++) {
            if (firstr_found == 1 && firstn_found == 1 && secondr_found == 1 && secondn_found == 1) { // WE FOUND IT!
                header_found = 1;
                // We want to skip the parts of the buffer we've already looked at, that's header, and our numbytes will be decreased by that many
                fwrite(buf + i, 1, numbytes - i, fp);
                break;
            }
            
            if (buf[i] == 13 && firstr_found == 0) { // We found our first \r, mark it and move on to next iteration
                firstr_found = 1;
                continue; 
            }
            if (buf[i] == 10 && firstr_found == 1 && firstn_found == 0) { // We found our first \n, mark it and move on
                firstn_found = 1;
                continue; 
            }
            else if (buf[i] != 13 && buf[i] != 10) { // Think about the second r, it'll ignore the first if, but fail on the second if, but we don't want to jump into this else block
                firstr_found = 0;
                firstn_found = 0;
                continue;
            }
            if (buf[i] == 13 && firstr_found == 1 && firstn_found == 1 && secondr_found == 0) {
                secondr_found = 1;
                continue;
            }
            else if (buf[i] != 10) {
                firstr_found = 0;
                firstn_found = 0;
                secondr_found = 0;
                continue;
            }
            if(buf[i] ==  10 && firstr_found == 1 && firstn_found == 1 && secondr_found == 1 && secondn_found == 0) {
                secondn_found = 1;
                continue;
            }
        }
    }
    memset(buf, 0, MAXDATASIZE);
    count++;
}
close(s);
fclose(fp);
0

Adding another answer because, well I suppose I think I'm clever. Thanks to @tadman for the idea of a counter. Look here (I'm going to shave off a lot of the bloat and just do the while loop, if you've looked at my other code blocks you should be able to see what I mean here) ...

/* ...
 * ...
*/
int consec_success = 0;
while ((numbytes = recv(socket_file_descriptor, buf, MAXDATASIZE - 1, 0)) > 0) {
    if (header_found == 1) {
        fwrite(buf, 1, numbytes, fp);
    }
    else {
        for (int i = 0; i < numbytes; i++) {
            if (consec_success == 4) {
                header_found = 1;
                fwrite(buf + i, 1, numbytes - i, fp);
                break;
            }
            
            if (buf[i] == 13 && consec_success % 2 == 0) {
                consec_success++;
            }
            else if (buf[i] == 10 && consec_success % 2 == 1) {
                consec_success++;
            }
            else {
                consec_success = 0;
            }
        }
    }
    memset(buf, 0, MAXDATASIZE);
}
/* ...
 * ...
*/