Boost Spirit Qi : Is it suitable language/tool to analyse/cut a "multiline" data file?

Question

I want to apply various operations to data files : algebra of sets, statistics, reporting, changes. But the format of the files is far from code examples and a bit weird. There are differents sorts of items, items type, and some of them are put together as a collection. There is a simplistic example below.
I'm new in boost::spirit and I have tried coding to split the items and get basic informations (name, version, date) required for most of treatments. Eventually it seems tricky for me. Is the problem my lack of skills or boost::spirit is not suitable to this format?
Studying boost::spirit is not a waste of time, I am sure to use it later. But I didn't find examples of code like mine, I may not go the right way.

>>>process_type_A
//name(typeA_1)
//version(A.1.99)
//date(2016.01.01)
//property1 "pA11"
//property2 "pA12"
//etc_A_1 (thousand of lines - a lot are "multiline" and/or mulitline sub-records)
<<<process_type_A
>>>process_type_A
//name(typeA_2)
//version(A.2.99)
//date(2016.01.02)
//property1 "pA21"
//property2 "pA22"
//etc_A_2 (hundred or thousand of lines)
<<<process_type_A
>>>process_type_B
//name(typeB_1)
//version(B.1.99)
//date(2016.02.01)
//property1 "pB11"
//property2 "pB12"
//etc_B_1 (hundred or thousand of lines)
<<<process_type_B
>>>paramset_type_C
//>>paramlist
////name(typeC_1)
////version(C.1.99)
////date(2016.03.01)
////property1 "pC11"
////property2 "pC12"
////etc_C_1 (hundred or thousand of lines)
//<<paramlist
//>>paramlist
////name(typeC_2)
////version(C.2.99)
////date(2016.04.01)
////property1 "pC21"
////property2 "pC22"
////etc_C_2 (hundred or thousand of lines)
//<<paramlist
<<<paramset_type_C

Code::Blocks
Boost 1.60.0
GCC Compiler on Windows and Linux

I think regex with captures is enough expressive tool for this grammar. — Tomilov Anatoliy, Jan 22 '16 at 17:38
I often use regex in shell, and I couldn't imagine use it because of the code readability and performance. I'm looking to the boost::regex library. — Steven M., Jan 23 '16 at 17:01
Isn't `std::regex` suitable (I meant it above, not shell)? I think `boost::regex` not excels it overly. — Tomilov Anatoliy, Jan 23 '16 at 18:03

score 2 · Accepted Answer · edited May 23 '17 at 12:15

2

I think @Orient is right: regex w/captures is enough here.

However, Spirit has the upside of coming without a linker dependency. Here's some approaches (using seek[] and raw[]) for inspiration:

Boost spirit revert parsing
rule to extract key+phrases from a text document
Parsing text file with binary envelope using boost Spririt (binary content)
much more involved logic: How to implement #ifdef in a boost::spirit::qi grammar?

Note that Spirit X3 (still experimental) also has a seek[] directive and it will compiler much faster.

edited May 23 '17 at 12:15

Community

1
1

answered Jan 22 '16 at 17:48

sehe

374,641
47
450
633

@Orient and your advice sounds wise. I will study those links. I didn't find them when I looked for information on stackoverflow. Usual problem of key word choice when searching. – Steven M. Jan 23 '16 at 17:09

score 1 · Answer 2 · answered Jan 22 '16 at 17:37

The main advice I would give about Qi is that it is a very powerful and flexible tool for parsing. You can define quite complicated, possibly recursive structures, using boost::variant, boost::optional, etc., and associate these types with qi rules and it seemingly magically does the right thing, giving you a nice AST for your data.

The biggest sources of difficulty in my (limited) experience are when you try to make it do more than that and also process the data. It's sometimes tempting to try to "eagerly" do some processing at the same time that you are parsing the data, often in a semantic action or something. Don't do it! It usually makes things harder to read in the end, a bit harder to debug, and sometimes you can be surprised what will happen if the grammar has to backtrack across your semantic action which it already executed.

qi should work great if you can write a nice grammar for your data. If you can't write an unambiguous grammar, you might be able to use qi::eps to make it parseable but you don't want to have to do that too often IMO. I don't think "hundreds or thousands" of items will pose any particular problem.

Right now the question is rather opinion-oriented -- if you can post a more complete description of the data format you have, or better, a complete code example which is failing, it might make it easier to give precise answers.

"Eagerly": you're right, it's a bad habit for coders.
Anyway Spirit will be useful to check/parse/scan some data subsets from those file type. — Steven M., Jan 23 '16 at 17:19

Boost Spirit Qi : Is it suitable language/tool to analyse/cut a "multiline" data file?

2 Answers2