Creating HTTP client to download a webpage for offline viewing in C

Question

I am creating a HTTP client which downloads a web page based on a command line argument. It takes the argument, looks up the domain name to get the IP address, creates a socket, connects to the server and sends a GET request and waits for a reply. This all works fine however when I read my reply in using a buffer and a while loop I also receive some unreadable characters. If you run the code and view the html you will see unreadable characters here and there on the page.

My Code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/socket.h>
#include <netdb.h>
#include <arpa/inet.h>

int main(int argc, char *argv[])
{
    int socket_desc, i, bytes_read;    
    char server_reply[1024], ip[100], request[100];;
    char *hostname = argv[1];
    struct sockaddr_in server;
    struct hostent *he;
    struct in_addr **addr_list;
    FILE *fp;

    if ((he = gethostbyname(hostname)) == NULL) {
        //gethostbyname failed
        herror("gethostbyname\n");
        return 1;
    }

    addr_list = (struct in_addr **) he->h_addr_list;

    for(i = 0; addr_list[i] != NULL; i++) {
        //Return the first one;
        strcpy(ip , inet_ntoa(*addr_list[i]) );
    }

    //Create socket
    socket_desc = socket(AF_INET, SOCK_STREAM, 0);
    if (socket_desc == -1) {
        printf("Could not create socket!\n");
    }

    server.sin_addr.s_addr = inet_addr(ip);
    server.sin_family = AF_INET;
    server.sin_port = htons(80);

    //Connect to remote server
    if (connect(socket_desc , (struct sockaddr *)&server , sizeof(server)) < 0) {
        printf("connect error!\n");
        return 1;
    }

    printf("Connected...\n");

    //Send some data
    snprintf(request, 99, "GET / HTTP/1.1\r\n"
            "Host: %s\r\n"
            "\r\n\r\n", hostname
    );

    if (send(socket_desc, request, strlen(request), 0) < 0) {
        puts("Send failed!\n");
        return 1;
    }
    puts("Data Sent...\n");

    //Receive a reply from the server

    fp = fopen("/home/localusr/Desktop/ouput.html", "w+");

    while (bytes_read = read(socket_desc, server_reply, sizeof(server_reply)) > 0) {
        fputs(server_reply, fp);
        memset(server_reply, 0, sizeof(server_reply));
    } 
    do {
        bytes_read = read(socket_desc, server_reply, sizeof(server_reply));
        fputs(server_reply, fp);
        memset(server_reply, 0, sizeof(server_reply));
    } while (bytes_read > 0);

    printf("reply received...\n");

    fclose(fp);
    close(socket_desc);

    return 0;
}

Sorry about the poorly indented code. Any help much appreciated. I am using an Ubuntu machine and using gcc to compile my code.

EDIT:

orb.ws.require.lib--> <script type="text/javascript">/*
be2

the be2 should not be there. * also get '@' symbols

Please don't try to do HTTP yourself without reading the necessary documentation. You are doing several things wrong here which indicate that you have no idea of proper HTTP. — Steffen Ullrich, Dec 11 '14 at 11:33
I have read RFC 2616 and my GET request matches the standard and there is no problem with that. The server replies with 200 OK and gives me the html. Its just my buffer that keeps adding characters that its not read. Your comment was incredibly unhelpful and actually showed you have little knowledge of a GET request. I cant use wget because I’m trying to make the program myself with C and without using to many libraries that will do it for me. — William Goodwin, Dec 11 '14 at 11:40
Use an HTTP client library like [libcurl](http://curl.haxx.se/libcurl/); the [HTTP](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol) protocol is complex enough; don't spend months or years of work to reimplement it. — Basile Starynkevitch, Dec 27 '14 at 10:26
@WilliamGoodwin: Steffen's comment was spot on. You DO NOT have a good understanding of how HTTP works, and your code reflects that. Your request has 1 too many CRLFs in it, and a reading loop to blindly read inbound data until the read fails is the **completely wrong way** to read the server's response. You apparently have missed [RFC 2616 Section 4.4](http://tools.ietf.org/html/rfc2616#section-4.4). You **MUST** analyze the response headers to know how to read the response body correctly. See http://stackoverflow.com/a/19211701/65863 for an example of the logic you **MUST** implement. — Remy Lebeau, Jan 05 '15 at 00:35

Morpfh · Answer 1 · 2015-01-04T23:21:58.247

Edit: To put my comment up here:

Note that on e.g. www.bbc.co.uk the response header say “Transfer-Encoding: chunked” which means every chunk has a hex digit for length followed by the data followed by \r\n.

That is, per your example:

be2\r\n => 0xbe2\r\n => 3042\r\n

or “Here follows 3042 bytes” (after \r\n aka CRLF or hex 0d0a).

Example of chunk:

e\r\nStack Exchange
|  | ||||||||||||||
|  | +............+
|  |        |
|  |        +-------- 14 bytes
|  +----------------- \r\n
+-------------------- 0x0e == 14 dec in hex

Old:

Instead of memset etc you can properly terminate read bytes by:

while ( (bytes_read = read(socket_desc, server_reply, sizeof(server_reply) - 1)) > 0) {
    server_reply[bytes_read] = 0x00;

Nothing beyond bytes_read will be fputs'ed after this.

As you memset entire buffer to 0 but also read into entire buffer - the memset has no effect unless read is less then buffer size. You simply overwrite all the zeroes on full (1024) reads and and write 1024 + garbage until first zero.

read() returns number of bytes read. By setting server_reply[bytes_read] to 0 you do in effect terminate the actual data. Make it into a C-string. Without setting last byte to zero fputs() will keep outputting garbage after bytes_read until first zero or crash.

Put another way; read() reads up to size bytes, not caring if it is all zero bytes. If you tell read() to read 356GiB of data and the file descriptor delivers 356GiB of zeros (as in 0x00 bytes, not ASCII 0) – that is what you get.

Your socket does not terminate delivery with zero. It deliver zero bytes as part of data as do your server. Say you transfered an image or some other data with zero bytes; In other words: it is not a zero terminated string read() gets.

Also note the - 1 after sizeof – to make room for the null byte.

fputs however writes until first terminating null byte, but does not include it in output (which is what you normally want if you are writing buffered stringdata).

Example:

char buf[8];

Char is uninitialized and contains garbage. For example it could be:

buf[0] == 0x13
buf[1] == 0x0a
buf[2] == 0x00
buf[3] == 0x65
buf[4] == 0x78
buf[5] == 0xf3
buf[6] == 0x00
buf[7] == 0xaf

beyond buf you have random garbage, for example

buf[7+1] == 0xde
buf[7+2] == 0xa0
buf[7+3] == 0x33
buf[7+3] == 0x00

bytes_read = read(soc, buf, 8); soc delivers: 'ABCDEFG'

Buffer is now:

buf[0] == 0x41 (A)
buf[1] == 0x42 (B)
buf[2] == 0x43 (C)
buf[3] == 0x44 (D)
buf[4] == 0x45 (E)
buf[5] == 0x46 (F)
buf[6] == 0x47 (G)
buf[7] == 0xaf (H)

But the bytes beyond buf[7] is still filled with garbage; and your fputs() is going to read and pass on the data to the file until first zero.

That is why you instead say:

bytes_read = read(soc, buf, 7);
buf[bytes_read] = 0x00;

Now we only read A-G. Last byte is set to 0.

Here fputs(buf, fh) is ging to write until first \0, in other words ABCDEFG.

If server now on next run-around delivers, say, only two bytes:

buf[0] == 0x48 (H)
buf[1] == 0x5A (Z)

Then bytes_read will be 2 and the statement:

buf[bytes_read] = 0x00 ===> buf[2] = 0x00

which gives you

buf[0] == 0x48 (H)
buf[1] == 0x5A (Z)
buf[2] == 0x00 (0x00) <<--- nulled out
                      +---.
buf[3] == 0x44 (D)    |    \
buf[4] == 0x45 (E)    |     \
buf[5] == 0x46 (F)    |      }--->>> garbage from previous read.
buf[6] == 0x47 (G)    |     / 
buf[7] == 0x00 (0x00) |    /
                      +---/

Here fputs(buf, fh) is ging to write until first \0, in other words HZ.

the reply will vary in size dependant on which web page’s html is requested. I am currently testing it on www.bbc.co.uk but www.macmillandictionary.com has much less html. Most web pages though have got more than 1024 bytes of html so that’s why I'm attempting to clear the buffer and go back to where it was full and read that in. this works but the end of each buffer is where the random characters are placed and I cant just delete a set number of characters off the end of the buffer because there not uniform. When I use your code to properly terminate read_bytes I don’t read anything. — William Goodwin, Dec 11 '14 at 12:36

Ishay Peled · Answer 2 · 2014-12-11T15:08:44.603

0

Did you try accessing the web page with telnet?

Please do the following:

telnet [hostname] [port]

And in the telnet shell type:

GET / HTTP/1.1
Host: [hostname]
<return>

(mind the extra return after Host!

Please post both the results from telnet and the results from your code

EDIT *

Found the problem:

You use fputs instead of fwrite. fputs expects a string, which it detects by looking for a NULL character.

In your case however, such NULL character is not promised, so you have to be explicit. As a bonus, your program now terminates and flushes input to file. The fix:

Replace your while and do while loops with the following do while loop:

do
{
    int write;
    bytes_read = read(socket_desc, server_reply, sizeof(server_reply));
    write = fwrite(server_reply, 1, bytes_read, fp);
    printf("Written %d bytes_read: %d\n", write, bytes_read);
    memset(server_reply, 0, sizeof(server_reply));
    fflush(fp);
} while (bytes_read > 0); // This termination is wrong! You should look at Content-Length from the server's reply to detect the actual length

It works now....

edited Dec 11 '14 at 15:08

answered Dec 11 '14 at 11:32

Ishay Peled

2,783
1
23
37

The html I get back is large but with the telnet I don’t get any random characters whereas with my buffer reading in the html I do. I have edited my question to give a short example of what the html with my buffer looks like. – William Goodwin Dec 11 '14 at 11:44
I ran your code, fixed it up a little bit by adding fflush and removing the other do { } while loop. Your program doesn't terminate, but the output looks fine... Which website are you trying to access, and what exactly is the output you get? Add the complete output here (try a site with little content) and I will compare it to what I'm getting. The HTML protocol is indeed somewhat complicated, but for your needs, you're doing it right. – Ishay Peled Dec 11 '14 at 14:25
I tried your code and it compiles and runs but it doesn’t terminate and doesn’t read enough in to get all of the html. I haven’t looked in to fflush so I might have a look at how fflush works.. – William Goodwin Dec 11 '14 at 14:50
Modified my answer, see if its complete now – Ishay Peled Dec 11 '14 at 14:59
@Morpfh - You are correct in the general case, however, note that these lines from his while loop: fputs(server_reply, fp); memset(server_reply, 0, sizeof(server_reply)); This means the entire buffer is zeroed out so the case your describe is not valid here – Ishay Peled Dec 11 '14 at 15:16
@IshayPeled: Yes it is. He zeroes out entire buffer but also read into entire buffer. The zeroing has no effect unless read is less then buffer size. For firs run around there is no zeroing at all and as such it will fail each time unless by chance there should be a zero at the given end of read. – Morpfh Dec 11 '14 at 15:27
The `memset` is pointless unless read should happen to be less then 1024 bytes. – Morpfh Dec 11 '14 at 15:27

4pie0 · Answer 3 · 2014-12-27T09:05:48.520

0

read() doesn't null terminate bytes. But fputs() depends on null termination, so you have to append 0x00 at the end of char array if it is to be passed to fputs().

edited Dec 27 '14 at 09:05

answered Dec 11 '14 at 15:10

4pie0

29,204
9
82
118

Creating HTTP client to download a webpage for offline viewing in C

3 Answers3

Example: