After several questions about parsing kind of complex logs I finally have been told the best method for doing so.
The question now is whether there is some way to improve performance and/or reduce memory usage, and even compilation times. I would ask that answers fulfill these constraints:
MS VS 2010 (not fully c++11, just some features implemented: auto, lambdas...) and boost 1.53 (the only issue here is that
string_view
was still not available, but it is still valid to usestring_ref
even remarking that it probably will be deprecated in the future).The logs are zipped and they are unzipped directly to ram memory using an open library that outputs an old raw C "char" array, so it is not worthy to use a
std::string
because memory is already allocated by the library. There are thousand of them and they fill several GB, so keeping them in memory is not an option. What I mean is that usingstring_view
is not possible due to deleting the logs after parsing them.It might be a good idea to parse the date string as a POSIX time. Only two remarks: it should be interesting to avoid allocating a string for this and as much as I know POSIX times does not admit ms so they should be kept in another extra variable.
Some strings are repeated through the logs (the road variable p.e.). It might be interesting to use some flyweight pattern (its boost implementation) for reducing memory, even keeping in mind that this will have a performance cost.
Compilation times are a pain when working with template libraries. I really appreciate any tweak that helps to reduce them: Maybe splitting the grammar into sub-grammars? Maybe using pre-compiled headers?
The final use of this is to make queries about any event, e.g. get all the GEAR events (values and times) and also having records of all the car variables within a fixed interval or every time an event occurs. There are two types of records in the logs: pure "Location" records and "Location + Event" records (I mean, every time an event is parsed the location must also be saved). Having them separated into two vectors allows fast queries but slows down parsing. Using only a common vector allows fast parsing but slows down queries. Any idea about this? Maybe boost multiple index container would help as it was suggested before?
Please, do not hesitate in provide any solution or change anything that you think may help to achieve the goal.
//#define BOOST_SPIRIT_DEBUG
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/qi.hpp>
#include <cstring> // strlen
typedef char const* It;
namespace MyEvents {
enum Kind { LOCATION, SLOPE, GEAR, DIR };
struct Event {
Kind kind;
double value;
};
struct LogRecord {
int driver;
double time;
double vel;
double km;
std::string date;
std::string road;
Event event;
};
typedef std::vector<LogRecord> LogRecords;
}
BOOST_FUSION_ADAPT_STRUCT(MyEvents::Event,
(MyEvents::Kind, kind)
(double, value))
BOOST_FUSION_ADAPT_STRUCT(MyEvents::LogRecord,
(std::string, date)
(double, time)
(int, driver)
(double, vel)
(std::string, road)
(double, km)
(MyEvents::Event, event))
namespace qi = boost::spirit::qi;
namespace QiParsers {
template <typename It>
struct LogParser : qi::grammar<It, MyEvents::LogRecords()> {
LogParser() : LogParser::base_type(start) {
using namespace qi;
kind.add
("SLOPE", MyEvents::SLOPE)
("GEAR", MyEvents::GEAR)
("DIR", MyEvents::DIR);
values.add("G1", 1.0)
("G2", 2.0)
("REVERSE", -1.0)
("NORTH", 1.0)
("EAST", 2.0)
("WEST", 3.0)
("SOUTH", 4.0);
MyEvents::Event null_event = {MyEvents::LOCATION, 0.0};
line_record
= '[' >> raw[repeat(4)[digit] >> '-' >> repeat(3)[alpha] >> '-' >> repeat(2)[digit] >> ' ' >>
repeat(2)[digit] >> ':' >> repeat(2)[digit] >> ':' >> repeat(2)[digit] >> '.' >> repeat(6)[digit]] >> "]"
>> " - " >> double_ >> " s"
>> " => Driver: " >> int_
>> " - Speed: " >> double_
>> " - Road: " >> raw[+graph]
>> " - Km: " >> double_
>> (" - " >> kind >> ": " >> (double_ | values) | attr(null_event));
start = line_record % eol;
//BOOST_SPIRIT_DEBUG_NODES((start)(line_record))
}
private:
qi::rule<It, MyEvents::LogRecords()> start;
qi::rule<It, MyEvents::LogRecord()> line_record;
qi::symbols<char, MyEvents::Kind> kind;
qi::symbols<char, double> values;
};
}
MyEvents::LogRecords parse_spirit(It b, It e) {
static QiParsers::LogParser<It> const parser;
MyEvents::LogRecords records;
parse(b, e, parser, records);
return records;
}
static char input[] =
"[2018-Mar-13 13:13:59.580482] - 0.200 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.0 - SLOPE: 5.5\n\
[2018-Mar-13 13:14:01.170203] - 1.790 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.0 - GEAR: G1\n\
[2018-Mar-13 13:14:01.170203] - 1.790 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.0 - DIR: NORTH\n\
[2018-Mar-13 13:14:01.170203] - 1.790 s => Driver: 0 - Speed: 0.1 - Road: A-11 - Km: 90.0\n\
[2018-Mar-13 13:14:01.170203] - 1.980 s => Driver: 0 - Speed: 0.0 - Road: A-11 - Km: 90.1 - GEAR: G2\n\
[2018-Mar-13 13:14:01.819966] - 2.440 s => Driver: 0 - Speed: 0.1 - Road: B-16 - Km: 90.2\n\
[2018-Mar-13 13:14:01.819966] - 2.440 s => Driver: 0 - Speed: 0.1 - Road: B-16 - Km: 90.2 - DIR: EAST\n\
[2018-Mar-13 13:15:01.819966] - 3.440 s => Driver: 0 - Speed: 0.2 - Road: B-16 - Km: 90.3 - SLOPE: -10\n\
[2018-Mar-13 13:14:01.170203] - 1.980 s => Driver: 0 - Speed: 0.0 - Road: B-16 - Km: 90.4 - GEAR: REVERSE\n";
static const size_t len = strlen(input);
namespace MyEvents { // for debug/demo
using boost::fusion::operator<<;
static inline std::ostream& operator<<(std::ostream& os, Kind k) {
switch(k) {
case LOCATION: return os << "LOCATION";
case SLOPE: return os << "SLOPE";
case GEAR: return os << "GEAR";
case DIR: return os << "DIR";
}
return os;
}
}
int main() {
MyEvents::LogRecords records = parse_spirit(input, input+len);
std::cout << "Parsed: " << records.size() << " records\n";
for (MyEvents::LogRecords::const_iterator it = records.begin(); it != records.end(); ++it)
std::cout << *it << "\n";
return 0;
}