Tokenize a string, and put each delimiter in it's own token

Question

Desired behaviour:

Everything after a '#' is ignored (# = comment).

Empty lines don't create tokens.

'{' creates a token of type BLOCK_OPEN.

'}' creates a token of type BLOCK_CLOSE.

'=' creates a token of type EQUALS.

Everything else creates a token of type LABEL.

Tokens must not have empty space(s)

For most inputs, my tokenization function flawlessly. Except one bug:

show_position = { x=-9 y =78 }

Note the lack of spaces!

The vector returned is missing the "=" between the "x" and the "-9".

How do I fix this bug? I tried debugging but couldn't figure out what I messed up. A fresh pair of eyes is a boon.

This is how I tokenize:

std::vector<Token> tokenizeLine(const std::string str)
{
    std::vector<Token> tokens;

    std::string::size_type start = 0;
    std::string::size_type end   = 0;
    while (end != std::string::npos)
    {
        enum POSES
        {
            EQUALS,
            OPEN,
            CLOSE,
            SPACE,
            EOL,
            RETURN,
            TAB,
            COMMENT,
            POSES_SIZE
        };
        std::string::size_type pos[] =
        {
            str.find('=', start),
            str.find('{', start),
            str.find('}', start),
            str.find(' ', start),
            str.find('\n', start),
            str.find('\r', start),
            str.find('\t', start),
            str.find('#', start)
        };
        end = *std::min_element(pos, &pos[POSES_SIZE]);

        switch (str[start])
        {
        case('=') :
            tokens.push_back(Token(Token::EQUALS, "="));
            break;
        case('{') :
            tokens.push_back(Token(Token::BLOCK_OPEN, "{"));
            break;
        case('}') :
            tokens.push_back(Token(Token::BLOCK_CLOSE, "}"));
            break;
        case(' ') :
        case('\n') :
        case('\r') :
        case('\t'):
            break;
        case('#') :
            return tokens;
            break;
        default:
            if(str.substr(start, end - start).length() > 0)
                tokens.push_back(Token(Token::LABEL, str.substr(start, end - start)));
        }

        // If at end, use start=maxSize.  Else use start=end+delimiter.
        start = ((end > (std::string::npos - sizeof(char)))
            ? std::string::npos : end + sizeof(char));
    }

    return tokens;
}

Here's one you can run in the comfort of your home:

std::vector<std::string> tokenizeLine(const std::string str)
{
    std::vector<std::string> tokens;

    std::string::size_type start = 0;
    std::string::size_type end   = 0;
    while (end != std::string::npos)
    {
        enum POSES // Deliminators
        {
            EQUALS,
            OPEN,
            CLOSE,
            SPACE,
            EOL,
            RETURN,
            TAB,
            COMMENT,
            POSES_SIZE
        };
        std::string::size_type pos[] =
        {
            str.find('=', start),
            str.find('{', start),
            str.find('}', start),
            str.find(' ', start),
            str.find('\n', start),
            str.find('\r', start),
            str.find('\t', start),
            str.find('#', start)
        };
        end = *std::min_element(pos, &pos[POSES_SIZE]);

        switch (str[start])
        {
        case('=') :
            tokens.push_back("=");
            break;
        case('{') :
            tokens.push_back("{");
            break;
        case('}') :
            tokens.push_back("}");
            break;
        case(' ') :
        case('\n') :
        case('\r') :
        case('\t'):
            break;
        case('#') :
            return tokens;
            break;
        default:
            if(str.substr(start, end - start).length() > 0)
                tokens.push_back(str.substr(start, end - start));
        }

        // If at end, use start=maxSize.  Else use start=end+delimiter.
        start = ((end > (std::string::npos - sizeof(char)))
            ? std::string::npos : end + sizeof(char));
    }
    return tokens;
}

@NathanOliver The given code does not function as intended, and therefore would be off-topic on Code Review. — syb0rg, Aug 04 '16 at 14:32
Sorry for not being clear. I edited the question to clarify the issue. — Ivan Rubinson, Aug 04 '16 at 14:34
@IvanRubinson Second code works perfectly well: http://ideone.com/i1tRr8 Please provide an MCVE. — alexeykuzmin0, Aug 04 '16 at 14:41
@alexeykuzmin0 mind the whitespace! The particular bug happens when the space is omitted. Your paste has spaces where my string doesn't. — Ivan Rubinson, Aug 04 '16 at 14:45

Jonathan Mee · Accepted Answer · 2017-09-29T22:54:39.257

This sounds like a job for a regex_iterator! For context free languages, like the one you're trying to work with, it's hard to beat regexes. So rather than trying to wrangle your code into shape, throw it out, and use the right tool for the job.

This regex has distinct captures for each of your desired tokens:

\s*(?:\n|(#[^\n]*)|(\{)|(\})|(=)|([^{}= \t\r\n]+))

Live Example

Given an input like, const auto input = "#Comment\n\nshow_position = { x=-9 y =78 }"s You could parse it as simply as:

vector<Tokens> tokens;

for_each(sregex_iterator(cbegin(input), cend(input), re), sregex_iterator(), [&](const auto& i) {
    if (i[1].length() > 0U) {
        tokens.emplace_back(Token::COMMENT, i[1]);
    } else if (i[2].length() > 0U) {
        tokens.emplace_back(Token::BLOCK_OPEN, "{"s);
    } else if (i[3].length() > 0U) {
        tokens.emplace_back(Token::BLOCK_CLOSE, "}"s);
    } else if (i[4].length() > 0U) {
        tokens.emplace_back(Token::EQUALS, "="s);
    } else if (i[5].length() > 0U) {
        tokens.emplace_back(Token::LABEL, i[5]);
    }
});

Live Example

alexeykuzmin0 · Answer 2 · 2016-08-04T15:00:20.170

-1

TL;DR: To fix this, you may add --end to the if statement in the default branch of your switch: IdeOne.

The problem here is that if a type of token you found is LABEL, you swallow one symbol more than you should. That's why the = symbol right after x is ignored. When you add a whitespace betwen them, this whitespace is ignored, and = sign is parsed correctly.

The symbol right after the LABEL type token is swallowed because of the following reason: you ignore the char #end. For all other types of tokens this is perfectly well since end in this case represents the last char of token, but for LABEL type of tokens end is equal to the number of first char after the token.

edited Aug 04 '16 at 15:00

answered Aug 04 '16 at 14:56

alexeykuzmin0

6,344
2
28
51

But there's no comment in the input? – Ivan Rubinson Aug 04 '16 at 14:57
You right, it's probably `LABEL`, not comment. The type of `Token` you use to represent a string and which is processed in `default` branch of `switch`. – alexeykuzmin0 Aug 04 '16 at 15:01
It makes `str[start]` be out of bounds (end & start underflows). – Ivan Rubinson Aug 06 '16 at 09:01

Tokenize a string, and put each delimiter in it's own token

2 Answers2