what is good practice for parsing through long strings in c++?

Question

I have to parse through a long string and assign the parts of the string to different variables. I did this in a very roundabout way, which works just fine, but doesn't read as well as I would like. Is there a more efficient way to loop through this?

What I'm doing is starting at the first index of the studentdata array, stopping at where there are commas and then storing what is between them until I reach the end of each string.

int rhs = studentData.find(","); 
string studentID = studentData.substr(0, rhs);

int lhs = rhs + 1;
rhs = studentData.find(",", lhs);
string firstName = studentData.substr(lhs, rhs - lhs);

lhs = rhs + 1;
rhs = studentData.find(",", lhs);
string lastName = studentData.substr(lhs, rhs - lhs);

lhs = rhs + 1;
rhs = studentData.find(",", lhs);
string eMail = studentData.substr(lhs, rhs - lhs);

lhs = rhs + 1;
rhs = studentData.find(",", lhs);
int age = stoi(studentData.substr(lhs, rhs - lhs));

lhs = rhs + 1;
rhs = studentData.find(",", lhs);
int daysInCourse1 = stoi(studentData.substr(lhs, rhs - lhs));

lhs = rhs + 1;
rhs = studentData.find(",", lhs);
int daysInCourse2 = stoi(studentData.substr(lhs, rhs - lhs));

lhs = rhs + 1;
rhs = studentData.find(",", lhs);
int daysInCourse3 = stoi(studentData.substr(lhs, rhs - lhs));

lhs = rhs + 1;
rhs = studentData.find(",", lhs);
to_string(degreeProgram) = studentData.substr(lhs, rhs - lhs);

Examples of the strings to parse:

    "A1,John,Smith,John1989@gm ail.com,20,30,35,40,SECURITY",
    "A2,Suzan,Erickson,Erickson_1990@gmailcom,19,50,30,40,NETWORK",

I appreciate any feedback or forwarding to different sources that may provide better insight.

Is there a specific schema to this long string? In that case you could take a look at [std::regex](https://en.cppreference.com/w/cpp/regex) — Zoso, Mar 29 '21 at 05:54
You should include a couple of the actual strings that you want to parse in the question too. Just [edit](https://stackoverflow.com/posts/66849181/edit) the question and put them in a code block. — Ted Lyngmo, Mar 29 '21 at 06:16
In retrospect, I understand that should have been obvious. Sorry. — NotPretendingToBeSomebodyElse, Mar 29 '21 at 06:26
[How can I read and parse CSV files in C++?](https://stackoverflow.com/questions/1120140/how-can-i-read-and-parse-csv-files-in-c) will likely help. — Ted Lyngmo, Mar 29 '21 at 06:33
Thanks, Ted. I'm very new to coding, so I didn't realize there was an abbreviation for that. I appreciate the help! — NotPretendingToBeSomebodyElse, Mar 29 '21 at 06:41
Unfortunately c++ STL doesn't have a quick and convenient way of splitting string on delimiter in one line of code, as so many other languages do. So you have to write a function or import a dependency. The cleanest is possibly this: https://stackoverflow.com/a/64886763/7098259 — Patrick Parker, Mar 29 '21 at 07:05

prehistoricpenguin · Answer 1 · 2021-03-29T07:09:20.030

There are many choices to be considered

Use regex to parse the string, take this code as an example(Your need GCC 4.9+ to compile it). Note that it's tricky to parse email using manually written parsers, or with regex, the code below only works for the simplified scenarios. To achieve good performance with regex, it recommended replacing std::regex with boost::regex or google's re2, since libstd++'s regex implementation is known to be slow.

#include <iostream>
#include <regex>
#include <string>

struct student {
  std::string id;
  std::string firstName;
  std::string lastName;
  std::string eMail;
  int age = 0;
  int daysInCourse1 = 0;
  int daysInCourse2 = 0;
  int daysInCourse3 = 0;
  std::string degreeProgram;
};

std::ostream& operator<<(std::ostream& os, const student& st) {
  os << "["
     << "id:" << st.id << ",firstName:" << st.firstName
     << ",lastName:" << st.lastName << ",eMail:" << st.eMail
     << ",age:" << st.age << ",daysInCourse1:" << st.daysInCourse1
     << ",daysInCourse2:" << st.daysInCourse2
     << ",daysInCourse3:" << st.daysInCourse3
     << ",degreeProgram:" << st.degreeProgram << "]" << std::endl;
  return os;
}

int main(int argc, char* argv[]) {
  std::string data =
      "1,firstName,lastName,eMail@mail.com,18,1,2,3,degreeProgram";
  const std::regex kPattern(
      R"((\d+),(\w+),(\w+),((\w+)(\.|_)?(\w*)@(\w+)(\.(\w+))+),(\d+),(\d+),(\d+),(\d+),(\w+))");
  std::smatch base_match;
  student st;
  if (std::regex_match(data, base_match, kPattern)) {
    st.id = base_match[1];
    st.firstName = base_match[2];
    st.lastName = base_match[3];
    st.eMail = base_match[4];
    st.age = std::stoi(base_match[11]);
    st.daysInCourse1 = std::stoi(base_match[12]);
    st.daysInCourse2 = std::stoi(base_match[13]);
    st.daysInCourse3 = std::stoi(base_match[14]);
    st.degreeProgram = base_match[15];

    std::cout << st;
  }
  return 0;
}

To parse the mail part, it's also suggested to have a try on boost.tokenizer and boost.sprit2

If the content string itself is generated by your code, I suggest using some serialization/deserialization library to make your code easier to maintain and less error-prone. The serialization/deserialization part has nothing to do with our business logic, so we'd better use libraries or frameworks to help us:

You may consider using:

I know that what you've given are just mere examples, but I guess mentioning boost.tokenizer might also be good, especially for this particular case. Spirit is also wroth mentioning imho, although it seems heavyweight for simple "break on commas" task. — alagner, Mar 29 '21 at 06:59
@alagner Thanks for your suggestions, I have added the snippet. — prehistoricpenguin, Mar 29 '21 at 07:02

Basile Starynkevitch · Answer 2 · 2021-03-29T07:41:25.590

what is good practice for parsing through long strings in c++?

This is explained in books like the Dragon book, and parsing techniques are similar in C++, in C or in Ocaml. You could also read books like Fowler's Domain Specific Languages, Scott's Programming language pragmatics, Pitrat's Artificial Beings: the conscience of a conscious machine (more speculative) and ACM SIGPLAN conference papers. Read of course the wikipages on parsing, on push down automaton, on context free grammars.

My suggestion is:

document in some written text (at least on paper) the syntax of acceptable inputs. You could use EBNF notation. Be aware that a set of examples do not define any syntax.
discuss and document what should be done by your software for unacceptable inputs.

Once you have specified (in writing) both points above, consider writing a recursive descent parser, or using a parser generator like ANTLR, or GNU bison, or something else (see this list).

Your documentation (of your parsed language) could be inspired by some specification of C++, like n3337 (or better), or this C++ reference, or some specification of C like n1570 (or better), or the definition of JSON or of YAML or of HTML or of CSV.

You might look, for inspiration, into the source code of existing open source C++ projects containing parsers (e.g. fish, Qt, RefPerSys, GCC, the Clang static analyzer etc...)

You probably want to avoid (or limit) backtracking in your parsing routines.

Be aware that in 2021 UTF-8 is used everywhere. Is

"A3,Basile,Starynkévitch,basile@starynkevitch.net,19,50,30,40,СТАРЫНКЕВИЧ",

some acceptable input (it contains the French é and in Cyrillic letters -Russian- СТАРЫНКЕВИЧ)? This should be documented! Parsing UTF-8 encoded text is not easy, but you could use GNU libunistring if allowed to.

Perhaps you want to use some database software, like sqlite or PostGreSQL. Both can be used (technically) from C++ code, and your example data looks like some database.

Exlife · Answer 3 · 2021-03-29T06:56:45.727

as you know exactly how the input string is composed of the fields. you can use "strtok_s / strtok_r" to fetch tokens sererated by seperators (in your case, seperators is ","); and you can store tokens in a array of string, or you can assign the token to corresponded variable one by one.

Notice that:

(1) the input string must be writable (it cannot be readonly, because strtok_s/strtok_r will change the string itself). Also it donot tell you the token is terminated by which char. sometimes we are interested at what the seprerator char is.

(2) strtok_s/strtok_r will skip empty tokens and only return not-empty tokens, so if the input string consist empty token (for eg: "t1,,,,t2" seperated by comma, you will get "t1", "t2", it wonot give you any empty string), the token index will be incorrect as expected.

So if the strtok's charecters donot fulfill your requirement, you can implement your own version of the function.

what is good practice for parsing through long strings in c++?

3 Answers3