1

i want to know, is there a possibility to find out where in the response Stream the header ends?

The background of the question is as following, i am using sockets in c to get content from a website, the content is encoded in gzip. I would like to read the content directly from stream and encode the gzip content with zlib. But how do i know the gzip content started and the http header is finished.

I roughly tried two ways which are giving me some, in my opinion, strange results. First, i read in the whole stream, and print it out in terminal, my http header ends with "\r\n\r\n" like i expected, but the secound time, i just retrieve the response once to get the header and then read the content with while loop, here the header ends without "\r\n\r\n".

Why? And which way is the right way to read in the content?

I'll just give you the code so you could see how i'm getting the response from server.

//first way (gives rnrn)
char *output, *output_header, *output_content, **output_result;
size_t size;
FILE *stream;
stream = open_memstream (&output, &size);
char BUF[BUFSIZ];
while(recv(socket_desc, BUF, (BUFSIZ - 1), 0) > 0)
{
    fprintf (stream, "%s", BUF);
}
fflush(stream);
fclose(stream);

output_result = str_split(output, "\r\n\r\n");
output_header = output_result[0];
output_content = output_result[1];

printf("Header:\n%s\n", output_header);
printf("Content:\n%s\n", output_content);

.

//second way (doesnt give rnrn)
char *content, *output_header;
size_t size;
FILE *stream;
stream = open_memstream (&content, &size);
char BUF[BUFSIZ];

if((recv(socket_desc, BUF, (BUFSIZ - 1), 0) > 0)
{
    output_header = BUF;
}

while(recv(socket_desc, BUF, (BUFSIZ - 1), 0) > 0)
{
    fprintf (stream, "%s", BUF); //i would just use this as input stream to zlib
}
fflush(stream);
fclose(stream);

printf("Header:\n%s\n", output_header);
printf("Content:\n%s\n", content);

Both give the same result printing them to terminal, but the secound one should print out some more breaks, at least i expect, because they get lost splitting the string.

I am new to c, so i might just oversee some easy stuff.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
Iwan1993
  • 1,669
  • 2
  • 17
  • 25
  • 1
    Header always ends with empty line. It is defined by RFC. The content can be encoded either with content length, or with chunked encoding, which makes a lot of difference in processing. – Valeri Atamaniouk Apr 26 '13 at 18:57
  • Can you explain me why the second one works? I mean i can change buffer size, but will still get the header only. Is there any spezial behavior the server does? – Iwan1993 Apr 26 '13 at 19:06
  • 1
    It would appear that your second example might work some of the time by sheer accident and luck. You are simply reading from the socket twice in a row. By accident of timing it might be that the first `recv()` gets the header and the second one gets the body. But you can't count on that. You have to parse the data that you receive and look for the blank line. – Celada Apr 26 '13 at 19:15
  • In chunked form the contents is transferred by pieces. First, you need to check what is the 'transfer-encoding' header value is (yes, case-insensitive). If it is "chunked", you are in a trouble. Instead of a simple content length, you will get a string with length of the next piece, then data, then next length and so on. If the length is 0, this is the last part, and after it there could be additional headers... If you need details, you need to check RFC2616. – Valeri Atamaniouk Apr 26 '13 at 19:17

1 Answers1

8

You are calling recv() in a loop until the socket disconnects or fails (and writing the received data to your stream the wrong way), storing all of the raw data into your char* buffer. That is not the correct way to read an HTTP response, especially if HTTP keep-alives are used (in which case no disconnect will occur at the end of the response). You must follow the rules outlined in RFC 2616. Namely:

  1. Read until the "\r\n\r\n" sequence is encountered. This terminates the response headers. Do not read any more bytes past that yet.

  2. Analyze the received headers, per the rules in RFC 2616 Section 4.4. They tell you the actual format of the remaining response data.

  3. Read the remaining data, if any, per the format discovered in #2.

  4. Check the received headers for the presence of a Connection: close header if the response is using HTTP 1.1, or the lack of a Connection: keep-alive header if the response is using HTTP 0.9 or 1.0. If detected, close your end of the socket connection because the server is closing its end. Otherwise, keep the connection open and re-use it for subsequent requests (unless you are done using the connection, in which case do close it).

  5. Process the received data as needed.

In short, you need to do something more like this instead (pseudo code):

string headers[];
byte data[];

string statusLine = read a CRLF-delimited line;
int statusCode = extract from status line;
string responseVersion = extract from status line;

do
{
    string header = read a CRLF-delimited line;
    if (header == "") break;
    add header to headers list;
}
while (true);

if ( !((statusCode in [1xx, 204, 304]) || (request was "HEAD")) )
{
    if (headers["Transfer-Encoding"] ends with "chunked")
    {
        do
        {
            string chunk = read a CRLF delimited line;
            int chunkSize = extract from chunk line;
            if (chunkSize == 0) break;

            read exactly chunkSize number of bytes into data storage;

            read and discard until a CRLF has been read;
        }
        while (true);

        do
        {
            string header = read a CRLF-delimited line;
            if (header == "") break;
            add header to headers list;
        }
        while (true);
    }
    else if (headers["Content-Length"] is present)
    {
        read exactly Content-Length number of bytes into data storage;
    }
    else if (headers["Content-Type"] begins with "multipart/")
    {
        string boundary = extract from Content-Type header;
        read into data storage until terminating boundary has been read;
    }
    else
    {
        read bytes into data storage until disconnected;
    }
}

if (!disconnected)
{
    if (responseVersion == "HTTP/1.1")
    {
        if (headers["Connection"] == "close")
            close connection;
    }
    else
    {
        if (headers["Connection"] != "keep-alive")
            close connection;
    }
}

check statusCode for errors;
process data contents, per info in headers list;
Community
  • 1
  • 1
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770