Convert URL encoding to printable characters

Question

I have to work with strings that contain URL encodings like "%C3%A7", and I need to convert these sequences to the corresponding printable characters. Therefore I wrote a function. It works, but it seems rather awkward. I am an absolute C/C++ beginner. Perhaps someone can point me to a more elegant solution, please.

#include <iostream> 

using namespace std;

static inline void substitute_specials(string &str) {
    const struct {string from,to;} substitutions[] { { "20"," " },{ "24","$" },{ "40","@" },{ "26","&" },{ "2C","," },{ "C3%A1","á" },{ "C3%A7","ç" },{ "C3%A9","é" } };
    size_t start_pos = 0;
    while ((start_pos = str.find("%", start_pos)) != string::npos) {
        start_pos++;
        for (int i=0; i< extent < decltype(substitutions) > ::value; i++) {
            if (str.compare(start_pos,substitutions[i].from.length(),substitutions[i].from)  == 0) {
                    str.replace(start_pos-1, substitutions[i].from.length()+1, substitutions[i].to);
                    start_pos += substitutions[i].to.length()-1;
                break; 
            }
        }
    }
}

int main() {
    string testString = "This%20is %C3%A1 test %24tring %C5ith %40 lot of spe%C3%A7ial%20charact%C3%A9rs%2C %26 worth many %24%24%24";
    substitute_specials(testString);
    cout << testString << "\n";
    return 0;
}

EDIT 26.12.2016: I am still stuck with this problem. I found some suggestions for librarys and some hand written functions, but if the run at all they will only decode %xx (2 byte hex code in string) like %20 = space. I havn't found any that would do 4 byte code like %C3%84 = Ä and I wasn't able to modify any. Also curl_easy_unescape library() asks for 2 byte codes. I found exactly what I need is available in javascript, the corresponding functions are encodeURI() / decodeURI(), see http://www.w3schools.com/tags/ref_urlencode.asp The C/C++ source of decodeURI() would probably solve my problem. Line 3829 in https://dxr.mozilla.org/mozilla-central/source/js/src/jsstr.cpp look like an implementation of that, but I can't extract what I need. From the other examples I have found: many use sscanf to convert a 2 byte hex code to an int using %x hex format, and then static_castint to retrieve the correct char. How can I modify that for 4-byte sequences? Current status of my function is

wstring url_decode2(char* SRC) {

wstring ret;
wchar_t ch;
int i, ii;
char sub[5];

for (i=0; i<strlen(SRC); i++) {
    if (SRC[i]=='%') {
        if ((SRC[i+3]=='%') && (SRC[i+1]>='A')) {
            sub[0]=SRC[i+4]; 
            sub[1]=SRC[i+5]; // ( also tried lsb/msb )
            sub[2]=SRC[i+1]; // skip +3, it's %
            sub[3]=SRC[i+2]; // 
            sub[4]='\0';
            i=i+5;
        } else {
            sub[0]=SRC[i+1];
            sub[1]=SRC[i+2];
            sub[2]='\0';
            i=i+2;
        }
        sscanf(&sub[0], "%x", &ii);
        ch=static_cast<wchar_t>(ii);
        ret+=ch;
    } else 
        ret+=SRC[i];

}
return ret;

}

Can anyone help me, please?

These are *not* UTF8 "encodings". They are URL(?) escape sequences. What you see in this page are UTF8 characters. In UTF8 ASCII characters appear the same, non-ASCII characters use 2 or more bytes to store but are displayed as one character. You need a URL decoding method. — Panagiotis Kanavos, Dec 13 '16 at 16:44
BTW [UTF8 literals](http://en.cppreference.com/w/cpp/language/string_literal) need the `u8` prefix, eg `u8"Δx = %"`. Or direct to string `auto testString=u8"Δx = %"s;` or `string testString=u8"Δx = %"s;` — Panagiotis Kanavos, Dec 13 '16 at 16:47
The [MSDN page on String and Character Literals](https://msdn.microsoft.com/en-us/library/69ze775t.aspx) explains how to use UTF8, UTF16 etc in C++ in a very nice way. — Panagiotis Kanavos, Dec 13 '16 at 16:52
What are you trying to do? Decode 8 sequences out of potential infinity? — n. m. could be an AI, Dec 13 '16 at 16:59
@n.m the string is URL encoded,eg `%20` is the space character. All the OP needs to do is find a URL decoding library, or write the code by hand. I couldn't find *one* good duplicate question - there are a lot of answers that either use a hand-written method or a library like libcurl's [curl_unescape](https://curl.haxx.se/libcurl/c/curl_unescape.html) — Panagiotis Kanavos, Dec 13 '16 at 17:07
@PanagiotisKanavos I have my own ideas about what OP needs but I want to hear from the OP. Thanks. — n. m. could be an AI, Dec 13 '16 at 17:09
@n.m what *else* could these characters be? They aren't Unicode, that's certain — Panagiotis Kanavos, Dec 13 '16 at 17:10
Unescaped, the data should be `This is á test $tring %C5ith @ lot of speçial charactérs, & worth many $$$` — Panagiotis Kanavos, Dec 13 '16 at 17:12
Possible duplicate of [Encode/Decode URLs in C++](http://stackoverflow.com/questions/154536/encode-decode-urls-in-c) — Panagiotis Kanavos, Dec 13 '16 at 17:16
I found the sequences I want to translate here in the column 'UTF-8'. http://www.utf8-chartable.de/ That's why I thought they are 'UTF-8'. Whatever they are called, I want to translate them to printable characters. My function works fine and the 'unescaped' string you stated is correct. My question was, if I can do this any better? — jamacoe, Dec 13 '16 at 17:45
@n.m. I'd rather not only decode 8 sequences, but all, if there is a function or whatever way to do it. Otherwise I'd have to list each one that may occur. — jamacoe, Dec 13 '16 at 17:47
URL encoding has nothing to do with UTF-8. It encodes bytes. — n. m. could be an AI, Dec 13 '16 at 20:41
It just so happens that bytes of your byte sequences form UTF-8 strings. — n. m. could be an AI, Dec 13 '16 at 20:41
You need to read https://en.wikipedia.org/wiki/Percent-encoding and and implement an algorithm that maps three characters "%UV" to one byte 0xUV, where U and V are any hexadecimal digits. — n. m. could be an AI, Dec 13 '16 at 20:49
@n.m. Well, I did't write "URL", that was edited in by someone. But as it turns out, I may be able to treat my strings like URLs and use some of the code suggested here: https://stackoverflow.com/questions/154536/encode-decode-urls-in-c — jamacoe, Dec 13 '16 at 22:43
@jamacoe the edit was made because these are **NOT** Unicode encodings. They are URL escape sequences. You were asking and looking for the wrong thing, which often leads to downvoting — Panagiotis Kanavos, Dec 14 '16 at 09:16

score 0 · Answer 1 · edited May 23 '17 at 12:01

0

The answer to my own question is this unescape/undecode URI routine, that also handles 2 and 3 byte sequences: https://stackoverflow.com/a/41434414/4335480

edited May 23 '17 at 12:01

Community

1
1

answered Jan 02 '17 at 23:27

jamacoe

519
4
16

Convert URL encoding to printable characters

1 Answers1