Downloading UTF-8 file with libcurl (ANSI works fine)

Question

I am writing an simple file downloader with a help of libcurl. Here's the code for downloading the file from HTTP server:

static size_t WriteCallback(void *contents, size_t size, size_t nmemb, void *userp) {
    ((std::string*)userp)->append((char*)contents, size * nmemb);
    return size * nmemb;
}

std::wstring result; //result with polish letters (ą, ę etc.)
CURL *curl;
CURLcode res;
std::string readBuffer;

curl = curl_easy_init();
ERROR_HANDLE(curl, L"CURL could not been inited.", MOD_INTERNET);
curl_easy_setopt(curl, CURLOPT_URL, url.c_str());
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
curl_easy_setopt(curl, CURLOPT_WRITEDATA, &readBuffer);
curl_easy_setopt(curl, CURLOPT_SSL_VERIFYPEER, 0L);
curl_easy_setopt(curl, CURLOPT_SSL_VERIFYHOST, 0L);
curl_easy_setopt(curl, CURLOPT_HTTPAUTH, CURLAUTH_BASIC);
curl_easy_setopt(curl, CURLOPT_USERPWD, (login + ":" + password).c_str()); //e.g.: "login:password"
curl_easy_setopt(curl, CURLOPT_POST, true);
//curl_easy_setopt(curl, CURLOPT_ENCODING, "UTF-8"); //does not change anything
res = curl_easy_perform(curl);
curl_easy_cleanup(curl);

result = C::toWString(readBuffer);
return res == 0; //0 = OK

It works fine when the file I want to download is encoded as ANSI (according to e.g. Notepad++). But when I try to download the UTF-8 file (UTF-8 without BOM), I get an error with some characters (e.g. polish letters) due to encoding problem.

For example, I run the code for two files with the same text ("to jest teść to") and saved it to std::wstring. The result is from ANSI file and result2 (problematic) from UTF-8 version:

Both files opened on server with e.g. Notepad++ displays the right text.

So, how can I get the UTF-8 file content with libcurl and save it to std::wstring with the proper encoding (so the debugger of Visual Studio will show it as to jest teść to)?

Storing UTF-8 in a wide string doesn't make a whole lot of sense. What's the point of doing that? — MrEricSir, Oct 20 '15 at 21:03
The code is not storing UTF-8 in a `std::wstring`. It is storing the UTF-8 in a `std::string` and then converting that to `std::wstring` after the download is finished. The problem is in the conversion, not the download itself. — Remy Lebeau, Oct 20 '15 at 21:05
@MrEricSir As I think (correct me if I'm wrong) wstring can store wide characters and will work well with UTF-8 (which will use more than 1 byte to store character of my polish text). Also the debugger shows that. And storing it inside of string is unclear (also methods like find etc. won't work as they should). — PolGraphic, Oct 20 '15 at 22:15
@PolGraphic No, as the name suggests UTF-8 is intended to be stored in 8-bit characters. UTF-16 expects to be stored in 16-bit characters. On Windows wchar_t is a 16-bit character, so you'd store UTF-16 strings in a wstring. — MrEricSir, Oct 21 '15 at 01:48
@MrEricSir I am no expert in that field but I know I had a **lot** of problems with UTF-8/polish characters with `std::string` (find method, `char a = 'ą'` and some Windows functions that wants `wchar_t` just to name few) and after reading few questions for `string vs wstring` (e.g. http://stackoverflow.com/questions/402283/stdwstring-vs-stdstring which says "for Windows almost **always** use wstring) I switched to `wstring` and I'm happy with that. — PolGraphic, Oct 21 '15 at 05:36

score 2 · Accepted Answer · answered Oct 20 '15 at 21:08

2

This is not a libcurl issue. You are storing the raw data in a std::string and then converting that to a std::wstring after the download is finished. You have to look at the charset reported in the HTTP response and decode the data to std::wstring accordingly. C::toWString() has no concept of charsets, so you should use something else, like ICONV or ICU. Or, if you know the data is always UTF-8, do the conversion manually (UTF conversions are easy to code by hand), or use C++11's built in UTF conversions using the std::wstring_convert class.

answered Oct 20 '15 at 21:08

Remy Lebeau

555,201
31
458
770

How to look at the charset reported in the HTTP response with libcurl request? – PolGraphic Oct 20 '15 at 22:16
1

The charset is in the `Content-Type` response header, which you can retrieve using [`curl_easy_getinfo()`](http://curl.haxx.se/libcurl/c/curl_easy_getinfo.html) with its `info` parameter set to [`CURLINFO_CONTENT_TYPE`](http://curl.haxx.se/libcurl/c/CURLINFO_CONTENT_TYPE.html). – Remy Lebeau Oct 20 '15 at 23:22
Thank you. For some documents I get e.g. `text/html; charset=ISO-8859-1`, but for many of them I just have `text/plain` (in both `ANSI` and `UTF-8` cases). Can I do something about it? – PolGraphic Oct 22 '15 at 10:16
1

libcurl reports what the server provides. If no charset is specified, you would have to parse the content (in the case of HTML or XML), or else use the default charset of the reported media type. For `text/plain` delivered over HTTP, that is `ISO-8859-1`, per RFC 2616 Section 3.7.1: "*When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value.*" – Remy Lebeau Oct 22 '15 at 16:30
If you are receiving a UTF-8 text file without a `charset=utf-8` declaration in the `Content-Type` header, then the server is not compliant with the HTTP specs. – Remy Lebeau Oct 22 '15 at 16:32

score 1 · Answer 2 · answered Oct 20 '15 at 20:38

libcurl won't convert or translate the contents for you. It will deliver the exact bytes to your application that the server sent out.

You can use HTTP Accept headers etc to affect what the server responds, but then you need to check the received charset and convert accordingly by yourself if you're not satisfied with what you get.

Downloading UTF-8 file with libcurl (ANSI works fine)

2 Answers2