Tokenize a String and Keep Delimiters Using Regular Expression in C++

Question

I would like to modify the given regular expression to produce the following list of matches. I am having a hard time describing the problem in words.

I want to use a regular expression to match a set of 'tokens'. Specifically I want &&,||,;,(,) to be matched, and any string that does not contain those characters should be a match. The problem I am having is distinguishing between one pipe and two pipes. How can i produce the desired matches? Thank you a lot for your help!

Link to this example

The expression:

((&{2})|(\|{2})|(\()|(\))|(;)|[^&|;()]+)

Test String

a < b | c | d > e >> f && ((g) || h) ; i

Expected Matches

a < b | c | d > e >> f 
&&

(
(
g
)

||
 h
)

;
 i

Actual Matches

a < b 
|
 c 
|
 d > e >> f 
&&

(
(
g
)

||
 h
)

;
 i

I am trying to implement a custom tokenizer for a program in C++.

Example Code

std::vector<std::string> Parser::tokenizeInput(std::string s) {
    std::vector<std::string> returnTokens;

    //tokenize correctly using this regex
    std::regex rgx(R"S(((&{2})|(\|{2})|(\()|(\))|(;)|[^&|;()]+))S");

    std::regex_iterator<std::string::iterator> rit ( s.begin(), s.end(), rgx );
    std::regex_iterator<std::string::iterator> rend;

    while (rit!=rend) {

        std::string tokenStr = rit->str();

        if(tokenStr.size() > 0 && tokenStr != " "){
            //assure the token is not blank
            //and push the token
            boost::algorithm::trim(tokenStr);
            returnTokens.push_back(tokenStr);
        }

        ++rit;
    }

    return returnTokens;
}

Example Driver Code

//in main
std::vector<std::string> testVec = Parser::tokenizeInput(inputWithNoComments);
std::cout << "input string: " << inputWithNoComments << std::endl;
std::cout << "tokenized string[";
for(unsigned int i = 0; i < testVec.size(); i++){
    std::cout << testVec[i];
    if ( i + 1 < testVec.size() ) { std::cout << ", "; }
}
std::cout << "]" << std::endl;

Produced Output

input string: (cat file > outFile) || ( ls -l | grep -i )
tokenized string[(, cat file > outFile, ), ||, (, ls -l, grep -i, )]

input string: a && b || c > d >> e < f | g
tokenized string[a, &&, b, ||, c > d >> e < f, g]

input string: foo | bar || foo || bar | foo | bar
tokenized string[foo, bar, ||, foo, ||, bar, foo, bar]

What I Want the Output to be

input string: (cat file > outFile) || ( ls -l | grep -i )
tokenized string[(, cat file > outFile, ), ||, (, ls -l | grep -i, )]

input string: a && b || c > d >> e < f | g
tokenized string[a, &&, b, ||, c > d >> e < f | g]

input string: foo | bar || foo || bar | foo | bar
tokenized string[foo | bar, ||, foo, ||, bar | foo | bar]

Which programming language are you using? We can try to write a method to do this. Will be easy with Java String `split()`. — Kartik, Dec 05 '17 at 05:15

score 4 · Accepted Answer · answered Dec 05 '17 at 08:16

I suggest a splitting approach by passing {-1,0} to the sregex_token_iterator to collect both non-matched and matched substrings, and using a much simpler regex like &&|\|\||[;()] while discarding the empty substrings (that are due to the way strings are split when consecutive matches are found):

std::regex rx(R"(&&|\|\||[();])");
std::string exp = "a < b | c | d > e >> f && ((g) || h) ; i";
std::sregex_token_iterator srti(exp.begin(), exp.end(), rx, {-1, 0});
std::vector<std::string> tokens;
std::remove_copy_if(srti, std::sregex_token_iterator(), 
                std::back_inserter(tokens),
                [](std::string const &s) { return s.empty(); });
for( auto & p : tokens ) std::cout <<"'"<< p <<"'"<< std::endl;

See the C++ demo, output:

'a < b | c | d > e >> f '
'&&'
' '
'('
'('
'g'
')'
' '
'||'
' h'
')'
' '
';'
' i'

Special credit for the empty string removal code goes to Jerry Coffin.

This is perfect, it produces exactly the output that I want and is simple and clean. I just revised my method to include this code. Thank you so much! — Brett K., Dec 05 '17 at 19:34

score 1 · Answer 2 · answered Dec 05 '17 at 06:07

1

You haven't specified which language you're using, but most app languages would support splitting a string on this regex:

" *((?=(\$\$|\|\||[;()])|(?<=\$\$|\|\|)|(?<=[;()])) *"

The regex is a look ahead or look behind for your terms, but being look arounds the input is not consumed so the delimiters will be output to the result array.

If you're using python, thing are much simpler; split on this regex:

" *(\$\$|\|\||[;()]) *"

Whatever of the delimiter is captured, becomes part of the output array.

answered Dec 05 '17 at 06:07

Bohemian

412,405
93
575
722

Thanks for your answer. I'm using C++. I can't seem to get your first regex to work though. The challenge is that I need to keep the delimiters as well in the output array. Do you know if there is a split function like you described in C++? – Brett K. Dec 05 '17 at 06:21
@Brett try [this](https://stackoverflow.com/questions/236129/the-most-elegant-way-to-iterate-the-words-of-a-string) – Bohemian Dec 05 '17 at 07:00

Allan · Answer 3 · 2017-12-05T07:24:56.473

0

I have prepared the following regex and tested it it produces exactly the same output as described on your input string:

(?<=&&)[^;()]*|\(|\)|(?<=\|\|)[^;()]*|;|&&|\|\||([^|;()&]+(\‌|[^|;()&]+)*)*

or this one:

\(|\)|;|&&|\|\||([^|;()&]+(&[^|;()&]+|\|[^|;()&]+)*)

Let me know if it works as expected!

Matches:

a < b | c | d > e >> f 
&&

(
(
g
)

||
 h
)

;
 i

and tested on:

(cat file > outFile) || ( ls -l | grep -i )
(cat file >> outFile) && ls -l | grep -i
((file < file) || ls -l ; ls)
cat < InputFile | tr a-z A-Z | tee out1 > out2 >> out3 | asd aasdasd  | asd | asd || asd | asd
a | b || c | d && a || b && d ; g && 
a && b || c > d >> e < f | g
a < b | c | d > e >> f && ((g) || h) ; i

edited Dec 05 '17 at 07:24

answered Dec 05 '17 at 06:42

Allan

12,117
3
27
51

Thanks for the answer, it almost works, but take a look at some of these test cases: https://regex101.com/r/fDf5VC/2 – Brett K. Dec 05 '17 at 06:54
`(?<=&&)[^;()]*|\(|\)|(?<=\|\|)[^;()]*|;|&&|\|\||([^|;()&]+(\|[^|;()&]+)*)*` that one should work fine for you! I have tested it and it looks great – Allan Dec 05 '17 at 07:15
or that one `\(|\)|;|&&|\|\||([^|;()&]+(&[^|;()&]+|\|[^|;()&]+)*)` – Allan Dec 05 '17 at 07:24
Thanks for your work on this! Take a look at this example https://regex101.com/r/fDf5VC/3 I would like `a < b | c | d > e >> f` to be one match instead of two – Brett K. Dec 05 '17 at 07:36
Ok I see, will have a look at it! – Allan Dec 05 '17 at 07:49

Tokenize a String and Keep Delimiters Using Regular Expression in C++

3 Answers3

Linked

Related