I am working on a project at university that attempts to create a JIT compiler in Python in C++. I am at the tokenisation step, and I managed to extract all strings and comments from the code. What I need is to divide the code into a flow of lexemes divided by Python operators (+, -, /, etc.) and separators (commas, semicolons and dots). It is essentially splitting the string but including the delimiters as well. From this question I thought about using regular expressions to capture all symbols that are either delimiters or are not. My only question is how do I specify a regular expression that:
- includes multiple characters (-=, //, !=);
- includes regex symbols like [, ], (, ), etc.
Thanks for any response in advance.
/// @brief Breaks the line down into a list of lexemes by
/// the delimiters preversing the delimiters themselves.
/// @param line The reference to the line to be tokenised.
/// @return List of lexemes ready to be parsed.
list<string> breakDown(string& line){
list<string> lexemes;
//const char expression[] = "[=-,;()[]]";
regex delimiters("(=|(|)|[|])|(=|(|)|[|])+)"); //This one doesn't work.
regex_iterator<string::iterator> it(line.begin(), line.end(), delimiters);
regex_iterator<string::iterator> end;
while (it != end) {
auto match = *it;
cout << "Match : " << match.str() << "\n";
string before = line.substring(0, match.position());
line = line.substring(match.position() + match.length());
lexemes.append(before);
lexemes.append(match.str());
it++;
}
return lexemes;
}