Curl replacing \u in response to \\u in c++

Question

I am sending a request using libcurl in windows and the response I get has some universal characters in them that start with \u. Libcurl is not recognizing this universal character and as a result, it escapes the \ turning the universal character to \\u. Is there any way to fix this? I have tried using str.replace but it can not replace escaped sequences the code I used to implent this was

#include <iostream>
#include <string>
#include <cpr/cpr.h>

int main()
{
    auto r = cpr::Get(cpr::Url{"http://prayer.osamaanees.repl.co/api"});
    std::string data = r.text;
    std::cout << data << std::endl;
    return 0;
}

This code uses the cpr library which is a wrapper for curl. It prints out the following:

{
"times":{"Fajr":"04:58 AM","Sunrise":"06:16 AM","Dhuhr":"12:30 PM","Asr":"04:58 PM","Maghrib":"06:43 PM","Isha":"08:00 PM"},
"date":"Tuesday, 20 Mu\u1e25arram 1442AH"
}

Notice the word Mu\u1e25arram, it should have been Muḥarram but since curl escaped the \ before u it prints out as \u1e25

How did you check this extra escaping? In a debugger? Because debuggers tend to C-style representations of data (they also turn a 0x0D byte into `\r`, for example) — Botje, Sep 08 '20 at 14:57
If you print the string with `cout << data` there will be no double backslash. You were just confusing the debugger's representation for the actual memory contents. — Botje, Sep 08 '20 at 14:59
I know it is doing this as I cant see the universal character when i use std::cout. it is showing as \u1e25 which should not be the case. — Osama Anees, Sep 08 '20 at 15:01
Try looking at https://groups.google.com/g/spray-user/c/4XCwzVeNyB0?pli=1 or https://stackoverflow.com/questions/8795702/how-to-convert-uxxxx-unicode-to-utf-8-using-console-tools-in-nix They may give you an idea of how to deal with unicode in curl, although I think they are specific to unix-like systems. — Steven W. Klassen, Sep 08 '20 at 15:02
@StevenW.Klassen those answers are specific to unix systems, in fact the same code works flawlessly in linux, the problem only arises when trying to compile in windows — Osama Anees, Sep 08 '20 at 15:06
@OsamaAnees I am pretty sure that escaping the \ after you received the data won't help. It's more a problem about the encoding the server uses to send the data. I don't know how to check for single Unicode characters in the string and replace them with something printable from the correct codepage. Maybe anyone else does here. Ty, for editing BTW. — πάντα ῥεῖ, Sep 08 '20 at 16:37
@πάνταῥεῖ hopefully this has been bugging me for days lol — Osama Anees, Sep 08 '20 at 16:41
@OsamaAnees Note that editing your question always bumps it up at the home and [active](https://stackoverflow.com/questions/tagged/c%2b%2b?tab=Active) page. This will help you to get more attention again. But don't abuse that to edit repeatedly please. — πάντα ῥεῖ, Sep 08 '20 at 16:45
How are you so sure this is curl's doing? Ehat are you comparing against? Are you sure this isn't what the server returns? Representing non-ascii characters as a unicode escape `\uXXXX` is part of the JSON spec. And `cout` will not interpret json escape code by itself, it just prints what is in the string. — Botje, Sep 08 '20 at 16:54
@OsamaAnees Maybe this helps what you need for replacement: https://stackoverflow.com/questions/12015571/how-to-print-unicode-character-in-c — πάντα ῥεῖ, Sep 08 '20 at 17:03
@Botje I never said I am sure... However I compiled the same code in linux using gcc and it worked flawlessly. — Osama Anees, Sep 08 '20 at 18:11

Remy Lebeau · Accepted Answer · 2020-09-08T18:46:24.533

4

Your analysis is wrong. Libcurl is not escaping anything. Load the URL in a web browser of your choosing and look at the raw data that is actually being sent. For example, this is what I see in Firefox:

The server really is sending Mu\u1e25arram, not Muḥarram like you are expecting. And this is perfectly fine, because the server is sending back JSON data, and JSON is allowed to escape Unicode characters like this. Read the JSON spec, particularly Section 9 on how Unicode codepoints may be encoded using hexidecimal escape sequences (which is optional in JSON, but still allowed). \u1e25 is simply the JSON hex-escaped form of ḥ.

You are merely printing out the JSON content as-is, exactly as the server sent it. You are not actually parsing it at all. If you were to use an actual JSON parser, Mu\u1e25arram would be decoded to Muḥarram for you. For example, here is how Firefox parses the JSON:

It is not libcurl's job to decode JSON data. Its job is merely to give you the data that the server sends. It is your job to interpret the data afterwards as needed.

edited Sep 08 '20 at 18:46

answered Sep 08 '20 at 18:38

Remy Lebeau

555,201
31
458
770

My analysis may have been wrong. I was using this [json](https://github.com/nlohmann/json) parser. It parsed \u characters correctly in linux but showed weird characters in windows. At first I thought it may be because of libcurl but now I guess this might be due to how these os use strings differently? – Osama Anees Sep 08 '20 at 19:14
1

@OsamaAnees I would expect any decent JSON parser to handle data the same way regardless of the OS used. Without seeing your actual parsing code, I would assume the problem is likely elsewhere. *Outputting* the data maybe be different depending on OS (ie, needing UTF-8 strings vs UTF-16 strings, etc), but *parsing* shouldn't be. – Remy Lebeau Sep 08 '20 at 20:26
Thank you so much! you made me realize what the actual problem was. _Outputting_. I changed the console output format and it worked by adding just 4 lines of code. – Osama Anees Sep 08 '20 at 21:54

score 0 · Answer 2 · answered Sep 08 '20 at 22:00

I would like to thank Remy for pointing out how wrong I was in thinking curl or the JSON parser was the problem when in reality I needed to convert my console to UTF-8 mode. It was after I fixed my Codepage I was able to get the output I wanted. For future reference, I am adding the code that fixed my problem:

We need to include Windows.h

#include <Windows.h>

Then at the start of our code:

UINT oldcp = GetConsoleOutputCP();
SetConsoleOutputCP(CP_UTF8);

After this we need to reset the console back to the original codepage with:

SetConsoleOutputCP(oldcp);

The Windows console does not handle UTF-8 very well, even with the code shown (there are *numerous* questions on StackOverflow on this topic!). It would be better to let the JSON parser decode the JSON to UTF-16 strings instead, and then write those UTF-16 strings to the console using Unicode APIs, rather than writing out UTF-8. — Remy Lebeau, Sep 08 '20 at 22:08

Curl replacing \u in response to \\u in c++

2 Answers2