2

I'm writing Http-Client which takes URL on somefile, download it and save it on a disk. Like curl does. I can use only C/C++ with std:: and libc. I have no problems with downloading text files like XML, CSV or txt, because they were saved like it should be and if to open them in editor - it's ok, there's that text which was expected. But when i download tar or pdf and trying to open them, it tells that files are corrupted.

Here's 2 main methods of my class HttpClient. HttpClient::get - send Http-request to the host, which is mentioned in URL, and calls the 2nd main method - HttpClient::receive which defines what kind of data there is - binary or text, and write whole Http-request body in a file using binary or text mode. All other methods i decided not to show, but i can if someone needs.

HttpClient::get:

bool HttpClient::get() {
    std::string protocol = getProtocol();
    if (protocol != "http://") {
        std::cerr << "Don't support no HTTP protocol" << std::endl;
        return false;
    }
    std::string host_name = getHost();

    std::string request = "GET ";
    request += url + " HTTP/" + HTTP_VERSION + "\r\n";
    request += "Host: " + host_name + "\r\n";
    request += "Accept-Encoding: gzip\r\n";
    request += "Connection: close\r\n";
    request += "\r\n";

    sock = socket(AF_INET, SOCK_STREAM, 0);
    if (sock < 0) {
        std::cerr << "Can't create socket" << std::endl;
        return false;
    }
    addr.sin_family = AF_INET;
    addr.sin_port = htons(HTTP_PORT);

    raw_host = gethostbyname(host_name.c_str());
    if (raw_host == NULL) {
        std::cerr << "No such host: " << host_name << std::endl;
        return false;
    }

    if (!this->connect()) {
        std::cerr << "Can't connect" << std::endl;
        return false;
    } else {
        std::cout << "Connection established" << std::endl;
    }

    if (!sendAll(request)) {
        std::cerr << "Error while sending HTTP request" << std::endl;
        return false;
    }

    if (!receive()) {
        std::cerr << "Error while receiving HTTP response" << std::endl;
        return false;
    }

    close(sock);
    return true;
}

HttpClient::receive:

bool HttpClient::receive() {
    char buf[BUF_SIZE];
    std::string response = "";
    std::ofstream file;
    FILE *fd = NULL;

    while (1) {
        size_t bytes_read = recv(sock, buf, BUF_SIZE - 1, 0);

        if (bytes_read < 0)
            return false;

        buf[bytes_read] = '\0';
        if (!file.is_open())
            std::cout << buf;

        if (!file.is_open()) {
            response += buf;
            std::string content = getHeader(response, "Content-Type");

            if (!content.empty()) {
                std::cout << "Content-Type: " << content << std::endl;
                if (content.find("text/") == std::string::npos) {
                    std::cout << "Binary mode" << std::endl;
                    file.open(filename, std::ios::binary);
                }
                else {
                    std::cout << "Text mode" << std::endl;
                    file.open(filename);
                }

                std::string::size_type start_file = response.find("\r\n\r\n");
                file << response.substr(start_file + 4);
            }
        }
        else
            file << buf;
        if (bytes_read == 0) {
            file.close();
            break;
        }
    }
    return true;
}

I can't find help, but i think that binary data is encoded in some way, but how to decode it?

klutt
  • 30,332
  • 17
  • 55
  • 95
A.Starshov
  • 39
  • 6
  • 2
    `buf[bytes_read] = '\0';` -- Unless I'm mistaken how you're reading the file, If the file is binary, why are you artificially sticking a null in the data? That would corrupt the binary data. – PaulMcKenzie Oct 29 '19 at 20:31
  • `response += buf` also won't work if there are nul characters in your binary data which is very likely to be the case. – Miles Budnek Oct 29 '19 at 20:39
  • 1
    your `receive()` is not properly parsing the HTTP response. It is just blindly reading arbitrary chunks of data until disconnected, trying to parse as it goes. You need to read the HTTP headers until you reach the terminating `\r\n\r\n`, THEN parse the headers to know the transmission format of the body, THEN read the body accordingly. See my answers to [Receiving only necessary data with C++ Socket](https://stackoverflow.com/questions/14421008/) and [When is an HTTP response finished?](https://stackoverflow.com/questions/19199066/) for pseudo code on reading an HTTP response properly. – Remy Lebeau Oct 29 '19 at 20:45

2 Answers2

1

I can't find help, but i think that binary data is encoded in some way, but how to decode it?

You don't explain why you think this way but the following line from your request might cause some encoding you don't handle:

request += "Accept-Encoding: gzip\r\n";

Here you explicitly say that you are willing to accept content encoded (compressed) with gzip. But looking at your code you are not even checking if the content es declared as encoded with gzip by analyzing the Content-Encoding header.

Apart from this the following line might cause a problem too:

request += url + " HTTP/" + HTTP_VERSION + "\r\n";

You don't show what HTTP_VERSION is but assuming that it is 1.1 you also have to deal with Transfer-Encoding: chunked too.

Steffen Ullrich
  • 114,247
  • 10
  • 131
  • 172
1

Thanks everyone. I solved this problem by changing response += buf; to response.append(buf, bytes_read); and file << buf; to file.write(buf, bytes_read);. It was stupid to write binary data like null-terminating string.

A.Starshov
  • 39
  • 6
  • 1
    Even with those fixes in place, your overall implementation is still wrong for other reasons, as I mentioned in another comment – Remy Lebeau Oct 30 '19 at 02:53