Boost::Spirit parser: Looking for max performance and min mem usage

Question

After several questions about parsing kind of complex logs I finally have been told the best method for doing so.

The question now is whether there is some way to improve performance and/or reduce memory usage, and even compilation times. I would ask that answers fulfill these constraints:

MS VS 2010 (not fully c++11, just some features implemented: auto, lambdas...) and boost 1.53 (the only issue here is that string_view was still not available, but it is still valid to use string_ref even remarking that it probably will be deprecated in the future).
The logs are zipped and they are unzipped directly to ram memory using an open library that outputs an old raw C "char" array, so it is not worthy to use a std::string because memory is already allocated by the library. There are thousand of them and they fill several GB, so keeping them in memory is not an option. What I mean is that using string_view is not possible due to deleting the logs after parsing them.
It might be a good idea to parse the date string as a POSIX time. Only two remarks: it should be interesting to avoid allocating a string for this and as much as I know POSIX times does not admit ms so they should be kept in another extra variable.
Some strings are repeated through the logs (the road variable p.e.). It might be interesting to use some flyweight pattern (its boost implementation) for reducing memory, even keeping in mind that this will have a performance cost.
Compilation times are a pain when working with template libraries. I really appreciate any tweak that helps to reduce them: Maybe splitting the grammar into sub-grammars? Maybe using pre-compiled headers?
The final use of this is to make queries about any event, e.g. get all the GEAR events (values and times) and also having records of all the car variables within a fixed interval or every time an event occurs. There are two types of records in the logs: pure "Location" records and "Location + Event" records (I mean, every time an event is parsed the location must also be saved). Having them separated into two vectors allows fast queries but slows down parsing. Using only a common vector allows fast parsing but slows down queries. Any idea about this? Maybe boost multiple index container would help as it was suggested before?

Please, do not hesitate in provide any solution or change anything that you think may help to achieve the goal.

//#define BOOST_SPIRIT_DEBUG
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/qi.hpp>
#include <cstring> // strlen

typedef char const* It;

namespace MyEvents {
    enum Kind { LOCATION, SLOPE, GEAR, DIR };

    struct Event {
        Kind kind;
        double value;
    };

    struct LogRecord {
        int driver;        
        double time;
        double vel;
        double km;
        std::string date;
        std::string road;
        Event event;
    };

    typedef std::vector<LogRecord> LogRecords;
}

BOOST_FUSION_ADAPT_STRUCT(MyEvents::Event,
    (MyEvents::Kind, kind)
    (double, value))


BOOST_FUSION_ADAPT_STRUCT(MyEvents::LogRecord,
        (std::string, date)
        (double, time)
        (int, driver)
        (double, vel)
        (std::string, road)
        (double, km)
        (MyEvents::Event, event))

namespace qi = boost::spirit::qi;

namespace QiParsers {
    template <typename It>
    struct LogParser : qi::grammar<It, MyEvents::LogRecords()> {

        LogParser() : LogParser::base_type(start) {
            using namespace qi;

            kind.add
                ("SLOPE", MyEvents::SLOPE)
                ("GEAR", MyEvents::GEAR)
                ("DIR", MyEvents::DIR);

            values.add("G1", 1.0)
                      ("G2", 2.0)
                      ("REVERSE", -1.0)
                      ("NORTH", 1.0)
                      ("EAST", 2.0)
                      ("WEST", 3.0)
                      ("SOUTH", 4.0);

            MyEvents::Event null_event = {MyEvents::LOCATION, 0.0};

            line_record
                = '[' >> raw[repeat(4)[digit] >> '-' >> repeat(3)[alpha] >> '-' >> repeat(2)[digit] >> ' ' >> 
                             repeat(2)[digit] >> ':' >> repeat(2)[digit] >> ':' >> repeat(2)[digit] >> '.' >> repeat(6)[digit]] >> "]"
                >> " - " >> double_ >> " s"
                >> " => Driver: "  >> int_
                >> " - Speed: "    >> double_
                >> " - Road: "     >> raw[+graph]
                >> " - Km: "       >> double_
                >> (" - " >> kind >> ": " >> (double_ | values) | attr(null_event));

            start = line_record % eol;

            //BOOST_SPIRIT_DEBUG_NODES((start)(line_record))
        }

      private:
        qi::rule<It, MyEvents::LogRecords()> start;

        qi::rule<It, MyEvents::LogRecord()> line_record;

        qi::symbols<char, MyEvents::Kind> kind;
        qi::symbols<char, double> values;
    };
}

MyEvents::LogRecords parse_spirit(It b, It e) {
    static QiParsers::LogParser<It> const parser;

    MyEvents::LogRecords records;
    parse(b, e, parser, records);

    return records;
}

static char input[] = 
"[2018-Mar-13 13:13:59.580482] - 0.200 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.0 - SLOPE: 5.5\n\
[2018-Mar-13 13:14:01.170203] - 1.790 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.0 - GEAR: G1\n\
[2018-Mar-13 13:14:01.170203] - 1.790 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.0 - DIR: NORTH\n\
[2018-Mar-13 13:14:01.170203] - 1.790 s => Driver: 0 - Speed: 0.1 - Road: A-11 - Km: 90.0\n\
[2018-Mar-13 13:14:01.170203] - 1.980 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.1 - GEAR: G2\n\
[2018-Mar-13 13:14:01.819966] - 2.440 s => Driver: 0 - Speed: 0.1 - Road: B-16 - Km: 90.2\n\
[2018-Mar-13 13:14:01.819966] - 2.440 s => Driver: 0 - Speed: 0.1 - Road: B-16 - Km: 90.2 - DIR: EAST\n\
[2018-Mar-13 13:15:01.819966] - 3.440 s => Driver: 0 - Speed: 0.2 - Road: B-16 - Km: 90.3 - SLOPE: -10\n\
[2018-Mar-13 13:14:01.170203] - 1.980 s => Driver: 0 - Speed: 0.0 - Road: B-16 - Km: 90.4 - GEAR: REVERSE\n";
static const size_t len = strlen(input);

namespace MyEvents { // for debug/demo
    using boost::fusion::operator<<;

    static inline std::ostream& operator<<(std::ostream& os, Kind k) {
        switch(k) {
            case LOCATION: return os << "LOCATION";
            case SLOPE:    return os << "SLOPE";
            case GEAR:     return os << "GEAR";
            case DIR:      return os << "DIR";
        }
        return os;
    }
}

int main() {
    MyEvents::LogRecords records = parse_spirit(input, input+len);
    std::cout << "Parsed: " << records.size() << " records\n";

    for (MyEvents::LogRecords::const_iterator it = records.begin(); it != records.end(); ++it)
        std::cout << *it << "\n"; 

    return 0;
}

sehe · Accepted Answer · 2018-03-31T15:39:31.607

Yes string_ref is essentially the same, but using slightly different interface from std::string_view at some points

Revision 1: POSIX time

Storing the POSIX time turns out to be really simple:

#include <boost/date_time/posix_time/posix_time_io.hpp>

Next, replace the type:

typedef boost::posix_time::ptime Timestamp;

struct LogRecord {
    int driver;
    double time;
    double vel;
    double km;
    Timestamp date;    // << HERE using Timestamp now
    std::string road;
    Event event;
};

And simplify the parser to just:

'[' >> stream >> ']'

Prints Live On Coliru

Parsed: 9 records
(2018-Mar-13 13:13:59.580482 0.2 0 0 A-11 90 (SLOPE 5.5))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (GEAR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (DIR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0.1 A-11 90 (LOCATION 0))
(2018-Mar-13 13:14:01.170203 1.98 0 0 A-11 90.1 (GEAR 2))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (LOCATION 0))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (DIR 2))
(2018-Mar-13 13:15:01.819966 3.44 0 0.2 B-16 90.3 (SLOPE -10))
(2018-Mar-13 13:14:01.170203 1.98 0 0 B-16 90.4 (GEAR -1))

Revision #2: Compressed Files

You can use IOStreams to transparently decompress input as well:

int main(int argc, char **argv) {
    MyEvents::LogRecords records;

    for (char** arg = argv+1; *arg && (argv+argc != arg); ++arg) {
        bool ok = parse_logfile(*arg, records);

        std::cout 
            << "Parsing " << *arg << (ok?" - success" : " - errors")
            << " (" << records.size() << " records total)\n";
    }

    for (MyEvents::LogRecords::const_iterator it = records.begin(); it != records.end(); ++it)
        std::cout << *it << "\n"; 
}

parse_logfile then can be implemented as:

template <typename It>
bool parse_spirit(It b, It e, MyEvents::LogRecords& into) {
    static QiParsers::LogParser<It> const parser;

    return parse(b, e, parser, into);
}

bool parse_logfile(char const* fname, MyEvents::LogRecords& into) {
    boost::iostreams::filtering_istream is;
    is.push(boost::iostreams::gzip_decompressor());

    std::ifstream ifs(fname, std::ios::binary);
    is.push(ifs);

    boost::spirit::istream_iterator f(is >> std::noskipws), l;
    return parse_spirit(f, l, into);
}

Note: The library has zlib, gzip and bzip2 decompressors. I opted for gzip for demonstration

Prints Live On Coliru

Parsing input.gz - success (9 records total)
(2018-Mar-13 13:13:59.580482 0.2 0 0 A-11 90 (SLOPE 5.5))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (GEAR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (DIR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0.1 A-11 90 (LOCATION 0))
(2018-Mar-13 13:14:01.170203 1.98 0 0 A-11 90.1 (GEAR 2))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (LOCATION 0))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (DIR 2))
(2018-Mar-13 13:15:01.819966 3.44 0 0.2 B-16 90.3 (SLOPE -10))
(2018-Mar-13 13:14:01.170203 1.98 0 0 B-16 90.4 (GEAR -1))

Revision 3: String Interning

"Interned" strings, or "Atoms" are common way to reduce string allocations. You can use Boost Flyweight, but in my experience it is a bit complicated to get right. So, why not create your own abstraction:

struct StringTable {
    typedef boost::string_ref Atom;
    typedef boost::container::flat_set<Atom> Index;
    typedef std::deque<char> Store;

    /* An insert in the middle of the deque invalidates all the iterators and
     * references to elements of the deque. An insert at either end of the
     * deque invalidates all the iterators to the deque, but has no effect on
     * the validity of references to elements of the deque.
     */
    Store backing;
    Index index;

    Atom intern(boost::string_ref const& key) {
        Index::const_iterator it = index.find(key);

        if (it == index.end()) {
            Store::const_iterator match = std::search(
                    backing.begin(), backing.end(),
                    key.begin(), key.end());

            if (match == backing.end()) {
                size_t offset = backing.size();
                backing.insert(backing.end(), key.begin(), key.end());
                match = backing.begin() + offset;
            }

            it = index.insert(Atom(&*match, key.size())).first;
        }
        // return the Atom from backing store
        return *it;
    }
};

Now, we need to integrate that into the parser. I'd suggest using a semantic action

Note: Traits could still help here, but they're static, and that would require the StringTable to be global, which is a choice I'd never make... unless absolutely obligated

First, changing the Ast:

struct LogRecord {
    int driver;
    double time;
    double vel;
    double km;
    Timestamp date;
    Atom road;       // << HERE using Atom now
    Event event;
};

Next, let's create a rule that creates such an atom:

qi::rule<It, MyEvents::Atom()> atom;

atom = raw[+graph][_val = intern_(_1)];

Of course, that begs the question how the semantic action is implemented:

struct intern_f {
    StringTable& _table;

    typedef StringTable::Atom result_type;
    explicit intern_f(StringTable& table) : _table(table) {}

    StringTable::Atom operator()(boost::iterator_range<It> const& range) const {
        return _table.intern(sequential(range));
    }

  private:
    // be more efficient if It is const char*
    static boost::string_ref sequential(boost::iterator_range<const char*> const& range) {
        return boost::string_ref(range.begin(), range.size());
    }
    template <typename OtherIt>
    static std::string sequential(boost::iterator_range<OtherIt> const& range) {
        return std::string(range.begin(), range.end());
    }
};
boost::phoenix::function<intern_f> intern_;

The grammar's constructor hooks up the intern_ functor to the StringTable& passed in.

Full Demo

Live On Coliru

//#define BOOST_SPIRIT_DEBUG
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/date_time/posix_time/posix_time_io.hpp>
#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/gzip.hpp>
#include <boost/utility/string_ref.hpp>
#include <boost/container/flat_set.hpp>
#include <fstream>
#include <cstring> // strlen

struct StringTable {
    typedef boost::string_ref Atom;
    typedef boost::container::flat_set<Atom> Index;
    typedef std::deque<char> Store;

    /* An insert in the middle of the deque invalidates all the iterators and
     * references to elements of the deque. An insert at either end of the
     * deque invalidates all the iterators to the deque, but has no effect on
     * the validity of references to elements of the deque.
     */
    Store backing;
    Index index;

    Atom intern(boost::string_ref const& key) {
        Index::const_iterator it = index.find(key);

        if (it == index.end()) {
            Store::const_iterator match = std::search(
                    backing.begin(), backing.end(),
                    key.begin(), key.end());

            if (match == backing.end()) {
                size_t offset = backing.size();
                backing.insert(backing.end(), key.begin(), key.end());
                match = backing.begin() + offset;
            }

            it = index.insert(Atom(&*match, key.size())).first;
        }
        // return the Atom from backing store
        return *it;
    }
};

namespace MyEvents {
    enum Kind { LOCATION, SLOPE, GEAR, DIR };

    struct Event {
        Kind kind;
        double value;
    };

    typedef boost::posix_time::ptime Timestamp;
    typedef StringTable::Atom Atom;

    struct LogRecord {
        int driver;
        double time;
        double vel;
        double km;
        Timestamp date;
        Atom road;
        Event event;
    };

    typedef std::vector<LogRecord> LogRecords;
}

BOOST_FUSION_ADAPT_STRUCT(MyEvents::Event,
        (MyEvents::Kind, kind)
        (double, value))

BOOST_FUSION_ADAPT_STRUCT(MyEvents::LogRecord,
        (MyEvents::Timestamp, date)
        (double, time)
        (int, driver)
        (double, vel)
        (MyEvents::Atom, road)
        (double, km)
        (MyEvents::Event, event))

namespace qi = boost::spirit::qi;

namespace QiParsers {
    template <typename It>
    struct LogParser : qi::grammar<It, MyEvents::LogRecords()> {

        LogParser(StringTable& strings) : LogParser::base_type(start), intern_(intern_f(strings)) {
            using namespace qi;

            kind.add
                ("SLOPE", MyEvents::SLOPE)
                ("GEAR", MyEvents::GEAR)
                ("DIR", MyEvents::DIR);

            values.add("G1", 1.0)
                      ("G2", 2.0)
                      ("REVERSE", -1.0)
                      ("NORTH", 1.0)
                      ("EAST", 2.0)
                      ("WEST", 3.0)
                      ("SOUTH", 4.0);

            MyEvents::Event null_event = {MyEvents::LOCATION, 0.0};

            atom = raw[+graph][_val = intern_(_1)];

            line_record
                = '[' >> stream >> ']'
                >> " - " >> double_ >> " s"
                >> " => Driver: "  >> int_
                >> " - Speed: "    >> double_
                >> " - Road: "     >> atom
                >> " - Km: "       >> double_
                >> (" - " >> kind >> ": " >> (double_ | values) | attr(null_event));

            start = line_record % eol;

            BOOST_SPIRIT_DEBUG_NODES((start)(line_record)(atom))
        }

      private:
        struct intern_f {
            StringTable& _table;

            typedef StringTable::Atom result_type;
            explicit intern_f(StringTable& table) : _table(table) {}

            StringTable::Atom operator()(boost::iterator_range<It> const& range) const {
                return _table.intern(sequential(range));
            }

          private:
            // be more efficient if It is const char*
            static boost::string_ref sequential(boost::iterator_range<const char*> const& range) {
                return boost::string_ref(range.begin(), range.size());
            }
            template <typename OtherIt>
            static std::string sequential(boost::iterator_range<OtherIt> const& range) {
                return std::string(range.begin(), range.end());
            }
        };
        boost::phoenix::function<intern_f> intern_;

        qi::rule<It, MyEvents::LogRecords()> start;

        qi::rule<It, MyEvents::LogRecord()> line_record;
        qi::rule<It, MyEvents::Atom()> atom;

        qi::symbols<char, MyEvents::Kind> kind;
        qi::symbols<char, double> values;
    };
}

template <typename It>
bool parse_spirit(It b, It e, MyEvents::LogRecords& into, StringTable& strings) {
    QiParsers::LogParser<It> parser(strings); // TODO optimize by not reconstructing all parser rules each time

    return parse(b, e, parser, into);
}

bool parse_logfile(char const* fname, MyEvents::LogRecords& into, StringTable& strings) {
    boost::iostreams::filtering_istream is;
    is.push(boost::iostreams::gzip_decompressor());

    std::ifstream ifs(fname, std::ios::binary);
    is.push(ifs);

    boost::spirit::istream_iterator f(is >> std::noskipws), l;
    return parse_spirit(f, l, into, strings);
}

namespace MyEvents { // for debug/demo
    using boost::fusion::operator<<;

    static inline std::ostream& operator<<(std::ostream& os, Kind k) {
        switch(k) {
            case LOCATION: return os << "LOCATION";
            case SLOPE:    return os << "SLOPE";
            case GEAR:     return os << "GEAR";
            case DIR:      return os << "DIR";
        }
        return os;
    }
}

int main(int argc, char **argv) {
    StringTable strings;
    MyEvents::LogRecords records;

    for (char** arg = argv+1; *arg && (argv+argc != arg); ++arg) {
        bool ok = parse_logfile(*arg, records, strings);

        std::cout 
            << "Parsing " << *arg << (ok?" - success" : " - errors")
            << " (" << records.size() << " records total)\n";
    }

    for (MyEvents::LogRecords::const_iterator it = records.begin(); it != records.end(); ++it)
        std::cout << *it << "\n"; 

    std::cout << "Interned strings: " << strings.index.size() << "\n";
    std::cout << "Table backing: '";
    std::copy(strings.backing.begin(), strings.backing.end(), std::ostreambuf_iterator<char>(std::cout));
    std::cout << "'\n";
    for (StringTable::Index::const_iterator it = strings.index.begin(); it != strings.index.end(); ++it) {
        std::cout << " entry - " << *it << "\n";
    }
}

When running with 2 input files, the second one slightly changed from the first:

zcat input.gz | sed 's/[16] - Km/ - Km/' | gzip > second.gz

It prints

Parsing input.gz - success (9 records total)
Parsing second.gz - success (18 records total)
(2018-Mar-13 13:13:59.580482 0.2 0 0 A-11 90 (SLOPE 5.5))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (GEAR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-11 90 (DIR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0.1 A-11 90 (LOCATION 0))
(2018-Mar-13 13:14:01.170203 1.98 0 0 A-11 90.1 (GEAR 2))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (LOCATION 0))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-16 90.2 (DIR 2))
(2018-Mar-13 13:15:01.819966 3.44 0 0.2 B-16 90.3 (SLOPE -10))
(2018-Mar-13 13:14:01.170203 1.98 0 0 B-16 90.4 (GEAR -1))
(2018-Mar-13 13:13:59.580482 0.2 0 0 A-1 90 (SLOPE 5.5))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-1 90 (GEAR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0 A-1 90 (DIR 1))
(2018-Mar-13 13:14:01.170203 1.79 0 0.1 A-1 90 (LOCATION 0))
(2018-Mar-13 13:14:01.170203 1.98 0 0 A-1 90.1 (GEAR 2))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-1 90.2 (LOCATION 0))
(2018-Mar-13 13:14:01.819966 2.44 0 0.1 B-1 90.2 (DIR 2))
(2018-Mar-13 13:15:01.819966 3.44 0 0.2 B-1 90.3 (SLOPE -10))
(2018-Mar-13 13:14:01.170203 1.98 0 0 B-1 90.4 (GEAR -1))

The interesting thing is in the interned string stats:

Interned strings: 4
Table backing: 'A-11B-16'
 entry - A-1
 entry - A-11
 entry - B-1
 entry - B-16

Note how B-1 and A-1 got deduplicated as substrings of A-11 and B-16 because they had already been interned. Preseeding your stringtable might be beneficial to achieve the best re-use.

Various Remarks

I don't have a lot of tips for reducing compilation times. I'd simply put all Spirit stuff in a separate TU and accept the compilation time on that one. After all, it's about trading compilation time for runtime performance.

Regarding the string interning, you might be better served with a flat_set<char const*> so that you only construct an atom with a particular length on demand.

If all strings are small, you might be (far) better off using just a small-string optimization.

I'll let you do the comparative benchmarking, you might want to keep using your own unzipping + const char* iterators. This was mostly to show that Boost has it, and you do not need to "read the whole file at once".

In fact, on that subject, you might want to store the results in a memory mapped file, so you will happily work on even if you exceed physical memory limitations.

Multi Index & Queries

You can find concrete examples about this in my previous answer: BONUS: Multi-Index

Note especially the way to get the index by reference:

Indexing::Table idx(events.begin(), events.end());

This can also be used to store a result-set in another (index) container for repeated/further processing.

Awesome explanation and full exposition of all you are able to achieve using the right boost library. I wonder if you have taken part in development or maintainance of boost::spirit. Again, thanks a lot for your answer (and also thank you for the free english lesson). — Pablo, Mar 31 '18 at 23:31
Maybe I am imposing upon, so If you prefer I can open a new question for the next two doubts. 1) date_time is one of the few boost libs that requires precompiling. It is not a big deal anyway, but I wonder if it would be easy to use other lib or just old time_t, even knowing its limitations and that it is not posix. 2) If I try to add the "Event" struct members into "LogRecord" struct (just leaving one only struct) I find that I can not use "attr(MyEvents::LOCATION) >> attr(0.0)" for giving values by default (only matches LOCATION, but it does not match 0.0). Any idea for solving that? — Pablo, Apr 01 '18 at 09:28
You can "fake it": https://gist.github.com/sehe/212ce5e3086eb3b26a6e6f806002f967/revisions using c++11 [get_time](http://en.cppreference.com/w/cpp/io/manip/get_time). Note that `Timestamp` is now twice as big. For alternatives, see e.g. https://stackoverflow.com/questions/37856887/how-do-i-parse-a-date-time-string-that-includes-fractional-time — sehe, Apr 01 '18 at 14:10
Now also added a revision that flattened `Event` into `LogRecord`, also **[Live On Coliru](http://coliru.stacked-crooked.com/a/5967e19b78402295)**. If you have further issues with that, I think it's time for a new (targeted) question — sehe, Apr 01 '18 at 14:14
I don't think that a new question is necessary. Your provided source code is clear and self explanatory. Thank you for all your help and attention. — Pablo, Apr 01 '18 at 23:48