5

I have a string which contains (not is) JSON-encoded data, like in this example:

foo([1, 2, 3], "some more stuff")
    |        |
  start     end   (of JSON-encoded data)

The complete language we use in our application nests JSON-encoded data, while the rest of the language is trivial (just recursive stuff). When parsing strings like this from left to right in a recursive parser, I know when I encounter a JSON-encoded value, like here the [1, 2, 3] starting at index 4. After parsing this substring, I need to know the end position to continue parsing the rest of the string.

I'd like to pass this substring to a well-tested JSON-parser like QJsonDocument in Qt5. But as reading the documentation, there is no possibility to parse only a substring as JSON, meaning that as soon as the parsed data ends (after consuming the ] here) control returns without reporting a parse error. Also, I need to know the end position to continue parsing my own stuff (here the remaining string is , "some more stuff")).

To do this, I used to use a custom JSON parser which takes the current position by reference and updates it after finishing parsing. But since it's a security-critical part of a business application, we don't want to stick to my self-crafted parser anymore. I mean there is QJsonDocument, so why not use it. (We already use Qt5.)

As a work-around, I'm thinking of this approach:

  • Let QJsonDocument parse the substring starting from the current position (which is no valid JSON)
  • The error reports an unexpected character, this is some position beyond the JSON
  • Let QJsonDocument parse again, but this time the substring with the correct end position

A second idea is to write a "JSON end scanner" which takes the whole string, a start position and returns the end position of the JSON-encoded data. This also requires parsing, as unmatched brackets / parentheses can appear in string values, but it should be much easier (and safer) to write (and use) such a class in comparison to a fully hand-crafted JSON-parser.

Does anybody have a better idea?

leemes
  • 44,967
  • 21
  • 135
  • 183
  • 3
    Nice question! Don't know the answer though :( – StackExchange User Apr 13 '13 at 18:21
  • I think I'm just going to write a simple _structure_ parser based on http://www.ietf.org/rfc/rfc4627.txt – sehe Apr 13 '13 at 18:40
  • Note: The "JSON end scanner" only needs to report a meaningful value if there is a valid JSON starting from the current position. If it is not, I still want to run QJsonDocument parser over the string to get a good error message. So a scanner might just count opening and closing brackets, not even distinguish betweem them. The only difficulty is to not count them within strings. Note that `\"` will not terminate a string. I guess this is enough information to write such a scanner. The question is if there is a better solution to do this (easier, faster, safer, ...). – leemes Apr 13 '13 at 18:45

1 Answers1

4

I rolled a quick parser[*] based on http://www.ietf.org/rfc/rfc4627.txt using Spirit Qi.

It doesn't actually parse into an AST, but it parses all of the JSON payload, which is actually a bit more than required here.

The sample here (http://liveworkspace.org/code/3k4Yor$2) outputs:

Non-JSON part of input starts after valid JSON: ', "some more stuff")'

Based on the test given by the OP:

const std::string input("foo([1, 2, 3], \"some more stuff\")");

// set to start of JSON
auto f(begin(input)), l(end(input));
std::advance(f, 4);

bool ok = doParse(f, l); // updates f to point after the start of valid JSON

if (ok) 
    std::cout << "Non-JSON part of input starts after valid JSON: '" << std::string(f, l) << "'\n";

I have tested with several other more involved JSON documents (including multiline).

A few remarks:

  • I made the parser Iterator-based so it will likely easily work with Qt strings(?)
  • If you want to disallow multi-line fragments, change the skipper from qi::space to qi::blank
  • There is a conformance shortcut regarding number parsing (see TODO) that doesn't affect validity for this answer (see comment).

[*] technically, this is more of a parser stub since it doesn't translate into something else. It is basically a lexer taking on too much work :)


Full Code of sample:

// #define BOOST_SPIRIT_DEBUG
#include <boost/spirit/include/qi.hpp>

namespace qi = boost::spirit::qi;

template <typename It, typename Skipper = qi::space_type>
    struct parser : qi::grammar<It, Skipper>
{
    parser() : parser::base_type(json)
    {
        // 2.1 values
        value = qi::lit("false") | "null" | "true" | object | array | number | string;

        // 2.2 objects
        object = '{' >> -(member % ',') >> '}';
        member = string >> ':' >> value;

        // 2.3 Arrays
        array = '[' >> -(value % ',') >> ']';

        // 2.4.  Numbers
        // Note out spirit grammar takes a shortcut, as the RFC specification is more restrictive:
        //
        // However non of the above affect any structure characters (:,{}[] and double quotes) so it doesn't
        // matter for the current purpose. For full compliance, this remains TODO:
        //
        //    Numeric values that cannot be represented as sequences of digits
        //    (such as Infinity and NaN) are not permitted.
        //     number = [ minus ] int [ frac ] [ exp ]
        //     decimal-point = %x2E       ; .
        //     digit1-9 = %x31-39         ; 1-9
        //     e = %x65 / %x45            ; e E
        //     exp = e [ minus / plus ] 1*DIGIT
        //     frac = decimal-point 1*DIGIT
        //     int = zero / ( digit1-9 *DIGIT )
        //     minus = %x2D               ; -
        //     plus = %x2B                ; +
        //     zero = %x30                ; 0
        number = qi::double_; // shortcut :)

        // 2.5 Strings
        string = qi::lexeme [ '"' >> *char_ >> '"' ];

        static const qi::uint_parser<uint32_t, 16, 4, 4> _4HEXDIG;

        char_ = ~qi::char_("\"\\") |
               qi::char_("\x5C") >> (       // \ (reverse solidus)
                   qi::char_("\x22") |      // "    quotation mark  U+0022
                   qi::char_("\x5C") |      // \    reverse solidus U+005C
                   qi::char_("\x2F") |      // /    solidus         U+002F
                   qi::char_("\x62") |      // b    backspace       U+0008
                   qi::char_("\x66") |      // f    form feed       U+000C
                   qi::char_("\x6E") |      // n    line feed       U+000A
                   qi::char_("\x72") |      // r    carriage return U+000D
                   qi::char_("\x74") |      // t    tab             U+0009
                   qi::char_("\x75") >> _4HEXDIG )  // uXXXX                U+XXXX
               ;

        // entry point
        json = value;

        BOOST_SPIRIT_DEBUG_NODES(
                (json)(value)(object)(member)(array)(number)(string)(char_));
    }

  private:
    qi::rule<It, Skipper> json, value, object, member, array, number, string;
    qi::rule<It> char_;
};

template <typename It>
bool tryParseAsJson(It& f, It l) // note: first iterator gets updated
{
    static const parser<It, qi::space_type> p;

    try
    {
        return qi::phrase_parse(f,l,p,qi::space);
    } catch(const qi::expectation_failure<It>& e)
    {
        // expectation points not currently used, but we could tidy up the grammar to bail on unexpected tokens
        std::string frag(e.first, e.last);
        std::cerr << e.what() << "'" << frag << "'\n";
        return false;
    }
}

int main()
{
#if 0
    // read full stdin
    std::cin.unsetf(std::ios::skipws);
    std::istream_iterator<char> it(std::cin), pte;
    const std::string input(it, pte);

    // set up parse iterators
    auto f(begin(input)), l(end(input));
#else
    const std::string input("foo([1, 2, 3], \"some more stuff\")");

    // set to start of JSON
    auto f(begin(input)), l(end(input));
    std::advance(f, 4);
#endif

    bool ok = tryParseAsJson(f, l); // updates f to point after the end of valid JSON

    if (ok) 
        std::cout << "Non-JSON part of input starts after valid JSON: '" << std::string(f, l) << "'\n";
    return ok? 0 : 255;
}
sehe
  • 374,641
  • 47
  • 450
  • 633
  • Thank you very much. Yes, QString has an iterator interface but qi seems to have problems with it... I currently try to solve it ;) Maybe it is because QString uses a custom character type, namely QChar, which might make qi unhappy. The iterator itself is `QChar*` so dealing with the iterator is not the problem. Substring construction doesn't work with `QString(f,l)` but with `QString(f, l-f)` (length parameter). (But your code worked without problems when using std::string. Thanks!) – leemes Apr 13 '13 at 19:45
  • @leemes If you care to throw me an isolated test app using Qt I might try to learn about Qt here :) - darn the [example](http://www.boost.org/doc/libs/1_48_0/libs/spirit/example/qi/custom_string.cpp) is for parsing _into_ QString – sehe Apr 13 '13 at 19:48
  • I found the problem, and sadly it's deep inside spirit: It tries to cast the character type to an int, which isn't supported by `QChar` :( It's in `boost/spirit/home/support/char_class.hpp`, line 785. – leemes Apr 13 '13 at 19:50
  • This looks more promising (posted to the spirit-general list in 2011): http://boost.2283326.n4.nabble.com/QString-and-QChar-support-in-qi-td3555736.html – sehe Apr 13 '13 at 19:51
  • Ah, didn't see your comment edit. Yes, this looks promising. Let me try this. – leemes Apr 13 '13 at 19:51
  • 4
    Out of curiosity I decided to make the JSON parser [UNICODE aware](https://raw.github.com/sehe/spirit-v2-json/master/testcases/test1.json) and parse to an actual AST tree ([a beauty if I may say so myself](https://github.com/sehe/spirit-v2-json/blob/master/JSON.hpp)). The roundtrip test checks out (allthough the ordering isn't stable, so the first test reports a false negative; use `list>` instead of `map` to prevent this). See [my github](https://github.com/sehe/spirit-v2-json) – sehe Apr 17 '13 at 01:21
  • 1
    And now I'm able to demo it _live_ online too: **[http://coliru..../view?id=079b418....](http://coliru.sehe.nl:8989/view?id=079b4187c3c0c23df896c918e70f2ee4-a2d2a787faa672a535d595284cedb612)** – sehe Apr 17 '13 at 22:20