1

I have to work with strings that contain URL encodings like "%C3%A7", and I need to convert these sequences to the corresponding printable characters. Therefore I wrote a function. It works, but it seems rather awkward. I am an absolute C/C++ beginner. Perhaps someone can point me to a more elegant solution, please.

#include <iostream> 

using namespace std;

static inline void substitute_specials(string &str) {
    const struct {string from,to;} substitutions[] { { "20"," " },{ "24","$" },{ "40","@" },{ "26","&" },{ "2C","," },{ "C3%A1","á" },{ "C3%A7","ç" },{ "C3%A9","é" } };
    size_t start_pos = 0;
    while ((start_pos = str.find("%", start_pos)) != string::npos) {
        start_pos++;
        for (int i=0; i< extent < decltype(substitutions) > ::value; i++) {
            if (str.compare(start_pos,substitutions[i].from.length(),substitutions[i].from)  == 0) {
                    str.replace(start_pos-1, substitutions[i].from.length()+1, substitutions[i].to);
                    start_pos += substitutions[i].to.length()-1;
                break; 
            }
        }
    }
}

int main() {
    string testString = "This%20is %C3%A1 test %24tring %C5ith %40 lot of spe%C3%A7ial%20charact%C3%A9rs%2C %26 worth many %24%24%24";
    substitute_specials(testString);
    cout << testString << "\n";
    return 0;
}

EDIT 26.12.2016: I am still stuck with this problem. I found some suggestions for librarys and some hand written functions, but if the run at all they will only decode %xx (2 byte hex code in string) like %20 = space. I havn't found any that would do 4 byte code like %C3%84 = Ä and I wasn't able to modify any. Also curl_easy_unescape library() asks for 2 byte codes. I found exactly what I need is available in javascript, the corresponding functions are encodeURI() / decodeURI(), see http://www.w3schools.com/tags/ref_urlencode.asp The C/C++ source of decodeURI() would probably solve my problem. Line 3829 in https://dxr.mozilla.org/mozilla-central/source/js/src/jsstr.cpp look like an implementation of that, but I can't extract what I need. From the other examples I have found: many use sscanf to convert a 2 byte hex code to an int using %x hex format, and then static_castint to retrieve the correct char. How can I modify that for 4-byte sequences? Current status of my function is

wstring url_decode2(char* SRC) {

wstring ret;
wchar_t ch;
int i, ii;
char sub[5];

for (i=0; i<strlen(SRC); i++) {
    if (SRC[i]=='%') {
        if ((SRC[i+3]=='%') && (SRC[i+1]>='A')) {
            sub[0]=SRC[i+4]; 
            sub[1]=SRC[i+5]; // ( also tried lsb/msb )
            sub[2]=SRC[i+1]; // skip +3, it's %
            sub[3]=SRC[i+2]; // 
            sub[4]='\0';
            i=i+5;
        } else {
            sub[0]=SRC[i+1];
            sub[1]=SRC[i+2];
            sub[2]='\0';
            i=i+2;
        }
        sscanf(&sub[0], "%x", &ii);
        ch=static_cast<wchar_t>(ii);
        ret+=ch;
    } else 
        ret+=SRC[i];

}
return ret;

}

Can anyone help me, please?

jamacoe
  • 519
  • 4
  • 16
  • 1
    These are *not* UTF8 "encodings". They are URL(?) escape sequences. What you see in this page are UTF8 characters. In UTF8 ASCII characters appear the same, non-ASCII characters use 2 or more bytes to store but are displayed as one character. You need a URL decoding method. – Panagiotis Kanavos Dec 13 '16 at 16:44
  • BTW [UTF8 literals](http://en.cppreference.com/w/cpp/language/string_literal) need the `u8` prefix, eg `u8"Δx = %"`. Or direct to string `auto testString=u8"Δx = %"s;` or `string testString=u8"Δx = %"s;` – Panagiotis Kanavos Dec 13 '16 at 16:47
  • The [MSDN page on String and Character Literals](https://msdn.microsoft.com/en-us/library/69ze775t.aspx) explains how to use UTF8, UTF16 etc in C++ in a very nice way. – Panagiotis Kanavos Dec 13 '16 at 16:52
  • What are you trying to do? Decode 8 sequences out of potential infinity? – n. m. could be an AI Dec 13 '16 at 16:59
  • @n.m the string is URL encoded,eg `%20` is the space character. All the OP needs to do is find a URL decoding library, or write the code by hand. I couldn't find *one* good duplicate question - there are a lot of answers that either use a hand-written method or a library like libcurl's [curl_unescape](https://curl.haxx.se/libcurl/c/curl_unescape.html) – Panagiotis Kanavos Dec 13 '16 at 17:07
  • @PanagiotisKanavos I have my own ideas about what OP needs but I want to hear from the OP. Thanks. – n. m. could be an AI Dec 13 '16 at 17:09
  • @n.m what *else* could these characters be? They aren't Unicode, that's certain – Panagiotis Kanavos Dec 13 '16 at 17:10
  • @PanagiotisKanavos I'm trying to ask OP, thank you. – n. m. could be an AI Dec 13 '16 at 17:11
  • Unescaped, the data should be `This is á test $tring %C5ith @ lot of speçial charactérs, & worth many $$$` – Panagiotis Kanavos Dec 13 '16 at 17:12
  • 1
    Possible duplicate of [Encode/Decode URLs in C++](http://stackoverflow.com/questions/154536/encode-decode-urls-in-c) – Panagiotis Kanavos Dec 13 '16 at 17:16
  • I found the sequences I want to translate here in the column 'UTF-8'. http://www.utf8-chartable.de/ That's why I thought they are 'UTF-8'. Whatever they are called, I want to translate them to printable characters. My function works fine and the 'unescaped' string you stated is correct. My question was, if I can do this any better? – jamacoe Dec 13 '16 at 17:45
  • @n.m. I'd rather not only decode 8 sequences, but all, if there is a function or whatever way to do it. Otherwise I'd have to list each one that may occur. – jamacoe Dec 13 '16 at 17:47
  • URL encoding has nothing to do with UTF-8. It encodes bytes. – n. m. could be an AI Dec 13 '16 at 20:41
  • It just so happens that bytes of your byte sequences form UTF-8 strings. – n. m. could be an AI Dec 13 '16 at 20:41
  • You need to read https://en.wikipedia.org/wiki/Percent-encoding and and implement an algorithm that maps three characters "%UV" to one byte 0xUV, where U and V are any hexadecimal digits. – n. m. could be an AI Dec 13 '16 at 20:49
  • @n.m. Well, I did't write "URL", that was edited in by someone. But as it turns out, I may be able to treat my strings like URLs and use some of the code suggested here: https://stackoverflow.com/questions/154536/encode-decode-urls-in-c – jamacoe Dec 13 '16 at 22:43
  • @jamacoe the edit was made because these are **NOT** Unicode encodings. They are URL escape sequences. You were asking and looking for the wrong thing, which often leads to downvoting – Panagiotis Kanavos Dec 14 '16 at 09:16

1 Answers1

0

The answer to my own question is this unescape/undecode URI routine, that also handles 2 and 3 byte sequences: https://stackoverflow.com/a/41434414/4335480

Community
  • 1
  • 1
jamacoe
  • 519
  • 4
  • 16