0

I am working on a project at university that attempts to create a JIT compiler in Python in C++. I am at the tokenisation step, and I managed to extract all strings and comments from the code. What I need is to divide the code into a flow of lexemes divided by Python operators (+, -, /, etc.) and separators (commas, semicolons and dots). It is essentially splitting the string but including the delimiters as well. From this question I thought about using regular expressions to capture all symbols that are either delimiters or are not. My only question is how do I specify a regular expression that:

  • includes multiple characters (-=, //, !=);
  • includes regex symbols like [, ], (, ), etc.
    Thanks for any response in advance.
/// @brief Breaks the line down into a list of lexemes by 
/// the delimiters preversing the delimiters themselves.
/// @param line The reference to the line to be tokenised.
/// @return List of lexemes ready to be parsed.
list<string> breakDown(string& line){
    list<string> lexemes;
    //const char expression[] = "[=-,;()[]]";
    regex delimiters("(=|(|)|[|])|(=|(|)|[|])+)"); //This one doesn't work.
    regex_iterator<string::iterator> it(line.begin(), line.end(), delimiters);
    regex_iterator<string::iterator> end;
    while (it != end) {
        auto match = *it;
        cout << "Match : " << match.str() << "\n"; 
        string before = line.substring(0, match.position());
        line = line.substring(match.position() + match.length());
        lexemes.append(before);
        lexemes.append(match.str());
        it++;
    }
    return lexemes;
}
  • It looks like you are using `using namespace std;` [try to not do that](https://stackoverflow.com/questions/1452721/why-is-using-namespace-std-considered-bad-practice). – Pepijn Kramer Nov 01 '22 at 18:32
  • Ah yes, namespace std. This is a fully-fledged project that is a ready product, it isn't a library and it is not meant to be imported anywhere, which is the reason why I think it is good enough to use namespace std. – Владислав Король Nov 01 '22 at 18:40
  • @ВладиславКороль You might want to read that link anyway.... Your product is not safe from issues. – hyde Nov 01 '22 at 18:49
  • @hyde I appreciate your concern, however I don't see any way how this can backfire if you only use std and don't import it anywhere. – Владислав Король Nov 01 '22 at 18:50
  • @ВладиславКороль -- the problem is that `using namespace std;` will bite **you** when you least expect it. – Pete Becker Nov 01 '22 at 19:59

1 Answers1

0

You need to escape the (, ), [, and the ]s that you want to match. This site should help you:

https://regex101.com/

kmeh
  • 156
  • 7
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Nov 04 '22 at 12:30