44

I'm having a very simple program that outputs simple JSON string that I manually concatenate together and output through the std::cout stream (the output really is that simple) but I have strings that could contain double-quotes, curly-braces and other characters that could break the JSON string. So I need a library (or a function more accurately) to escape strings accordingly to the JSON standard, as lightweight as possible, nothing more, nothing less.

I found a few libraries that are used to encode whole objects into JSON but having in mind my program is 900 line cpp file, I rather want to not rely on a library that is few times bigger then my program just to achieve something as simple as this.

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
ddinchev
  • 33,683
  • 28
  • 88
  • 133
  • For people wanting to get an escaped version of JSON which they can send simply as string in c. Note that this is not a library but an online tool. http://tomeko.net/online_tools/cpp_text_escape.php?lang=en – Sahil Singh Apr 09 '17 at 10:00

4 Answers4

63

Caveat

Whatever solution you take, keep in mind that the JSON standard requires that you escape all control characters. This seems to be a common misconception. Many developers get that wrong.

All control characters means everything from '\x00' to '\x1f', not just those with a short representation such as '\x0a' (also known as '\n'). For example, you must escape the '\x02' character as \u0002.

See also: ECMA-404 - The JSON data interchange syntax, 2nd edition, December 2017, Page 4

Simple solution

If you know for sure that your input string is UTF-8 encoded, you can keep things simple.

Since JSON allows you to escape everything via \uXXXX, even " and \, a simple solution is:

#include <sstream>
#include <iomanip>

std::string escape_json(const std::string &s) {
    std::ostringstream o;
    for (auto c = s.cbegin(); c != s.cend(); c++) {
        if (*c == '"' || *c == '\\' || ('\x00' <= *c && *c <= '\x1f')) {
            o << "\\u"
              << std::hex << std::setw(4) << std::setfill('0') << static_cast<int>(*c);
        } else {
            o << *c;
        }
    }
    return o.str();
}

Shortest representation

For the shortest representation you may use JSON shortcuts, such as \" instead of \u0022. The following function produces the shortest JSON representation of a UTF-8 encoded string s:

#include <sstream>
#include <iomanip>

std::string escape_json(const std::string &s) {
    std::ostringstream o;
    for (auto c = s.cbegin(); c != s.cend(); c++) {
        switch (*c) {
        case '"': o << "\\\""; break;
        case '\\': o << "\\\\"; break;
        case '\b': o << "\\b"; break;
        case '\f': o << "\\f"; break;
        case '\n': o << "\\n"; break;
        case '\r': o << "\\r"; break;
        case '\t': o << "\\t"; break;
        default:
            if ('\x00' <= *c && *c <= '\x1f') {
                o << "\\u"
                  << std::hex << std::setw(4) << std::setfill('0') << static_cast<int>(*c);
            } else {
                o << *c;
            }
        }
    }
    return o.str();
}

Pure switch statement

It is also possible to get along with a pure switch statement, that is, without if and <iomanip>. While this is quite cumbersome, it may be preferable from a "security by simplicity and purity" point of view:

#include <sstream>

std::string escape_json(const std::string &s) {
    std::ostringstream o;
    for (auto c = s.cbegin(); c != s.cend(); c++) {
        switch (*c) {
        case '\x00': o << "\\u0000"; break;
        case '\x01': o << "\\u0001"; break;
        ...
        case '\x0a': o << "\\n"; break;
        ...
        case '\x1f': o << "\\u001f"; break;
        case '\x22': o << "\\\""; break;
        case '\x5c': o << "\\\\"; break;
        default: o << *c;
        }
    }
    return o.str();
}

Using a library

You might want to have a look at https://github.com/nlohmann/json, which is an efficient header-only C++ library (MIT License) that seems to be very well-tested.

You can either call their escape_string() method directly (Note that this is a bit tricky, see comment below by Lukas Salich), or you can take their implementation of escape_string() as a starting point for your own implementation:

https://github.com/nlohmann/json/blob/ec7a1d834773f9fee90d8ae908a0c9933c5646fc/src/json.hpp#L4604-L4697

malat
  • 12,152
  • 13
  • 89
  • 158
vog
  • 23,517
  • 11
  • 59
  • 75
  • 1
    "You can either call their escape_string() method directly" => I tried that, but it's private so not, you can't use that library. – Lukas Salich Jan 15 '21 at 13:03
  • @LukasSalich Thanks for pointing this out. I updated my answer accordingly. – vog Jan 17 '21 at 21:20
  • But it's called in the nlohmann::json constructor or when calling .dump method. So you can escape the string by calling the constructor and passing to stream or passing the .dump result. – Lukas Salich Jan 18 '21 at 08:16
  • current link to function `dump_escaped` in nlohmann/json.hpp: [here](https://github.com/nlohmann/json/blob/6f551930e5c7ef397056de121c0da82f77573cca/include/nlohmann/detail/output/serializer.hpp#L380-L630) (still private). another stable implementation: [chromium json/string_escape.cc](https://chromium.googlesource.com/chromium/src/base/+/master/json/string_escape.cc) – milahu Apr 22 '21 at 19:47
  • `'\x00' <= *c` is always true – milahu Apr 22 '21 at 21:37
  • At least the second would be better done using the preprocessor: `#define X(a, b) case a: o << b; break;` `switch (*c) { X(0, "\\u0000") X(1, "\\u0001") ... X('\n', "\\n") ...` – Deduplicator Apr 22 '21 at 22:29
  • @Deduplicator I'm not sure if one should really do that. The whole purpose of that variant is obvious correctness, deliberately at the cost of having more redundant code. On the other hand, a slightly different macro with just one parameter which e.g. expands `Y(01)` to `case '\x01': o << "\\u0001"; break;` might protect against certain types of typos, and hence be worth the introduction of that indirection. – vog Apr 23 '21 at 08:30
  • 1
    @MilaNautikus On many systems, `char` defaults to `signed char`, which is why the check `'\x00' <= *c` is necessary. See also: https://stackoverflow.com/q/3728045/19163 – vog Apr 23 '21 at 08:43
  • @vog I think using the macro and thus reducing redundancy makes errors more obvious. Still, if you can get that variant done easily, that would be even better. Beware octal numbers though. Also, isn't `\x7f` a control-character too? – Deduplicator Apr 23 '21 at 11:37
  • @Deduplicator From the perspective of the JSON standard, `\x7f` does not need to be escaped. Relevant quote: "the code points that must be escaped: quotation mark (U+0022), reverse solidus (U+005C), and the control characters U+0000 to U+001F". See: https://www.ecma-international.org/wp-content/uploads/ECMA-404_2nd_edition_december_2017.pdf#page=12 – vog Apr 23 '21 at 11:44
  • @vog Funny how they have a subtly different definition of "control characters". Also a shame they don't include `\0` for NUL, and `\xAA` escape-sequences. – Deduplicator Apr 23 '21 at 13:09
  • @vog "On many systems, `char` defaults to `signed char`" - then it should be faster to cast to `unsigned char` or `std::uint8_t` – milahu Apr 29 '21 at 15:31
  • RFC7159 Section 7 "To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a 12-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E"." - what about this? – ivan.ukr May 03 '21 at 17:32
  • 1
    @ivan.ukr To my understanding, this section is irrelevant to ensure a secure JSON encoding. It is only important if you want your JSON to contain only (7-bit) ASCII characters, for whatever reason. If you don't have that constraint, you can encode every special character, including U+1D11E, just directly as UTF-8 and be done with it. However, it is important to repeat that the above code assumes "you know for sure that your input string is UTF-8 encoded". If your input string might contain ill-formed UTF-8 sequences, you can't take that approach. – vog May 05 '21 at 12:13
6

I have written a simple JSON escape and unescaped functions. The code is public available in GitHub. For anyone interested here is the code:

enum State {ESCAPED, UNESCAPED};

std::string escapeJSON(const std::string& input)
{
    std::string output;
    output.reserve(input.length());

    for (std::string::size_type i = 0; i < input.length(); ++i)
    {
        switch (input[i]) {
            case '"':
                output += "\\\"";
                break;
            case '/':
                output += "\\/";
                break;
            case '\b':
                output += "\\b";
                break;
            case '\f':
                output += "\\f";
                break;
            case '\n':
                output += "\\n";
                break;
            case '\r':
                output += "\\r";
                break;
            case '\t':
                output += "\\t";
                break;
            case '\\':
                output += "\\\\";
                break;
            default:
                output += input[i];
                break;
        }

    }

    return output;
}

std::string unescapeJSON(const std::string& input)
{
    State s = UNESCAPED;
    std::string output;
    output.reserve(input.length());

    for (std::string::size_type i = 0; i < input.length(); ++i)
    {
        switch(s)
        {
            case ESCAPED:
                {
                    switch(input[i])
                    {
                        case '"':
                            output += '\"';
                            break;
                        case '/':
                            output += '/';
                            break;
                        case 'b':
                            output += '\b';
                            break;
                        case 'f':
                            output += '\f';
                            break;
                        case 'n':
                            output += '\n';
                            break;
                        case 'r':
                            output += '\r';
                            break;
                        case 't':
                            output += '\t';
                            break;
                        case '\\':
                            output += '\\';
                            break;
                        default:
                            output += input[i];
                            break;
                    }

                    s = UNESCAPED;
                    break;
                }
            case UNESCAPED:
                {
                    switch(input[i])
                    {
                        case '\\':
                            s = ESCAPED;
                            break;
                        default:
                            output += input[i];
                            break;
                    }
                }
        }
    }
    return output;
}
mariolpantunes
  • 1,114
  • 2
  • 15
  • 28
  • 1
    I know enums are pretty, but wouldn't a boolean have sufficed? – Phillip Elm Jan 06 '14 at 09:27
  • 1
    I know this is an old answer, but I guess the cleanest approach is to take the input by value, modify it directly and return it by move semantics (which is automatic in return statements), this avoid useless copies. – markand May 26 '15 at 09:11
  • 3
    Do not use this code in production! This JSON escaping misses important special characters. See: http://stackoverflow.com/a/33799784 – vog Nov 19 '15 at 10:24
  • 2
    @vog, complete or not, the nice thing about this answer is that it includes a unescape function, while your answer does not. – Timothy Miller Jan 24 '17 at 02:36
  • 2
    @TimothyMiller How is this relevant here? The question was only about escaping. Moreover, the unescape function doesn't handle all cases, either (e.g. no `\uXXXX`). So my warning still holds: Don't use any of these functions in production code! – vog Feb 14 '17 at 19:06
  • 1
    > complete or not, the nice thing about this answer is that it includes a unescape function, But IT IS WRONG. A wrong algorithm that does more, but is wrong, is no use. – Tom Swirly Jun 06 '20 at 09:29
2

to build on vog's answer:

generate a full jump table for characters 0 to 92 = null to backslash

// generate full jump table for c++ json string escape
// license is public domain or CC0-1.0
//var s = require('fs').readFileSync('case-list.txt', 'utf8');
var s = ` // escape hell...
        case '"': o << "\\\\\\""; break;
        case '\\\\': o << "\\\\\\\\"; break;
        case '\\b': o << "\\\\b"; break;
        case '\\f': o << "\\\\f"; break;
        case '\\n': o << "\\\\n"; break;
        case '\\r': o << "\\\\r"; break;
        case '\\t': o << "\\\\t"; break;
`;
const charMap = new Map();
s.replace(/case\s+'(.*?)':\s+o\s+<<\s+"(.*?)";\s+break;/g, (...args) => {
  const [, charEsc, replaceEsc ] = args;
  const char = eval(`'${charEsc}'`);
  const replace = eval(`'${replaceEsc}'`);
  //console.dir({ char, replace, });
  charMap.set(char, replace);
});
iMax = Math.max(
  0x1f, // 31. 0 to 31: control characters
  '""'.charCodeAt(0), // 34
  '\\'.charCodeAt(0), // 92
);
const replace_function_name = 'String_showAsJson';
const replace_array_name = replace_function_name + '_replace_array';
// longest replace (\u0000) has 6 chars + 1 null byte = 7 byte
var res = `\
// ${iMax + 1} * 7 = ${(iMax + 1) * 7} byte / 4096 page = ${Math.round((iMax + 1) * 7 / 4096 * 100)}%
char ${replace_array_name}[${iMax + 1}][7] = {`;
res += '\n  ';
let i, lastEven;
for (i = 0; i <= iMax; i++) {
  const char = String.fromCharCode(i);
  const replace = charMap.has(char) ? charMap.get(char) :
    (i <= 0x1f) ? '\\u' + i.toString(16).padStart(4, 0) :
    char // no replace
  ;
  const hex = '0x' + i.toString(16).padStart(2, 0);
  //res += `case ${hex}: o << ${JSON.stringify(replace)}; break; /`+`/ ${i}\n`;
  //if (i > 0) res += ',';
  //res += `\n  ${JSON.stringify(replace)}, // ${i}`;
  if (i > 0 && i % 5 == 0) {
    res += `// ${i - 5} - ${i - 1}\n  `;
    lastEven = i;
  }
  res += `${JSON.stringify(replace)}, `;
}
res += `// ${lastEven} - ${i - 1}`;
res += `\n};

void ${replace_function_name}(std::ostream & o, const std::string & s) {
  for (auto c = s.cbegin(); c != s.cend(); c++) {
    if ((std::uint8_t) *c <= ${iMax})
      o << ${replace_array_name}[(std::uint8_t) *c];
    else
      o << *c;
  }
}
`;

//console.log(res);
document.querySelector('#res').innerHTML = res;
<pre id="res"></pre>
milahu
  • 2,447
  • 1
  • 18
  • 25
1

You didn't say exactly where those strings you're cobbling together are coming from, originally, so this may not be of any use. But if they all happen to live in the code, as @isnullxbh mentioned in this comment to an answer on a different question, another option is to leverage a lovely C++11 feature: Raw string literals.

I won't quote cppreference's long-winded, standards-based explanation, you can read it yourself there. Basically, though, R-strings bring to C++ the same sort of programmer-delimited literals, with absolutely no restrictions on content, that you get from here-docs in the shell, and which languages like Perl use so effectively. (Prefixed quoting using curly braces may be Perl's single greatest invention:)

my qstring = q{Quoted 'string'!};
my qqstring = qq{Double "quoted" 'string'!};
my replacedstring = q{Regexps that /totally/! get eaten by your parser.};
replacedstring =~ s{/totally/!}{(won't!)}; 
# Heh. I see the syntax highlighter isn't quite up to the challege, though.

In C++11 or later, a raw string literal is prefixed with a capital R before the double quotes, and inside the quotes the string is preceded by a free-form delimiter (one or multiple characters) followed by an opening paren.

From there on, you can safely write literally anything other than a closing paren followed by your chosen delimiter. That sequence (followed by a closing double quote) terminates the raw literal, and then you have a std::string that you can confidently trust will remain unmolested by any parsing or string processing.

"Raw"-ness is not lost in subsequent manipulations, either. So, borrowing from the chapter list for Crockford's How JavaScript Works, this is completely valid:

std::string ch0_to_4 = R"json(
[
    {"number": 0, "chapter": "Read Me First!"},
    {"number": 1, "chapter": "How Names Work"},
    {"number": 2, "chapter": "How Numbers Work"},
    {"number": 3, "chapter": "How Big Integers Work"},
    {"number": 4, "chapter": "How Big Floating Point Works"},)json";

std::string ch5_and_6 = R"json(
    {"number": 5, "chapter": "How Big Rationals Work"},
    {"number": 6, "chapter": "How Booleans Work"})json";

std::string chapters = ch0_to_4 + ch5_and_6 + "\n]";
std::cout << chapters;

The string 'chapters' will emerge from std::cout completely intact:

[
    {"number": 0, "chapter": "Read Me First!"},
    {"number": 1, "chapter": "How Names Work"},
    {"number": 2, "chapter": "How Numbers Work"},
    {"number": 3, "chapter": "How Big Integers Work"},
    {"number": 4, "chapter": "How Big Floating Point Works"},
    {"number": 5, "chapter": "How Big Rationals Work"},
    {"number": 6, "chapter": "How Booleans Work"}
]
FeRD
  • 1,699
  • 15
  • 24