1

Desired behaviour:

  • Everything after a '#' is ignored (# = comment).
  • Empty lines don't create tokens.
  • '{' creates a token of type BLOCK_OPEN.
  • '}' creates a token of type BLOCK_CLOSE.
  • '=' creates a token of type EQUALS.
  • Everything else creates a token of type LABEL.
  • Tokens must not have empty space(s)

For most inputs, my tokenization function flawlessly. Except one bug:

show_position = { x=-9 y =78 }

Note the lack of spaces!

The vector returned is missing the "=" between the "x" and the "-9".

How do I fix this bug? I tried debugging but couldn't figure out what I messed up. A fresh pair of eyes is a boon.


This is how I tokenize:

std::vector<Token> tokenizeLine(const std::string str)
{
    std::vector<Token> tokens;

    std::string::size_type start = 0;
    std::string::size_type end   = 0;
    while (end != std::string::npos)
    {
        enum POSES
        {
            EQUALS,
            OPEN,
            CLOSE,
            SPACE,
            EOL,
            RETURN,
            TAB,
            COMMENT,
            POSES_SIZE
        };
        std::string::size_type pos[] =
        {
            str.find('=', start),
            str.find('{', start),
            str.find('}', start),
            str.find(' ', start),
            str.find('\n', start),
            str.find('\r', start),
            str.find('\t', start),
            str.find('#', start)
        };
        end = *std::min_element(pos, &pos[POSES_SIZE]);

        switch (str[start])
        {
        case('=') :
            tokens.push_back(Token(Token::EQUALS, "="));
            break;
        case('{') :
            tokens.push_back(Token(Token::BLOCK_OPEN, "{"));
            break;
        case('}') :
            tokens.push_back(Token(Token::BLOCK_CLOSE, "}"));
            break;
        case(' ') :
        case('\n') :
        case('\r') :
        case('\t'):
            break;
        case('#') :
            return tokens;
            break;
        default:
            if(str.substr(start, end - start).length() > 0)
                tokens.push_back(Token(Token::LABEL, str.substr(start, end - start)));
        }

        // If at end, use start=maxSize.  Else use start=end+delimiter.
        start = ((end > (std::string::npos - sizeof(char)))
            ? std::string::npos : end + sizeof(char));
    }

    return tokens;
}

Here's one you can run in the comfort of your home:

std::vector<std::string> tokenizeLine(const std::string str)
{
    std::vector<std::string> tokens;

    std::string::size_type start = 0;
    std::string::size_type end   = 0;
    while (end != std::string::npos)
    {
        enum POSES // Deliminators
        {
            EQUALS,
            OPEN,
            CLOSE,
            SPACE,
            EOL,
            RETURN,
            TAB,
            COMMENT,
            POSES_SIZE
        };
        std::string::size_type pos[] =
        {
            str.find('=', start),
            str.find('{', start),
            str.find('}', start),
            str.find(' ', start),
            str.find('\n', start),
            str.find('\r', start),
            str.find('\t', start),
            str.find('#', start)
        };
        end = *std::min_element(pos, &pos[POSES_SIZE]);

        switch (str[start])
        {
        case('=') :
            tokens.push_back("=");
            break;
        case('{') :
            tokens.push_back("{");
            break;
        case('}') :
            tokens.push_back("}");
            break;
        case(' ') :
        case('\n') :
        case('\r') :
        case('\t'):
            break;
        case('#') :
            return tokens;
            break;
        default:
            if(str.substr(start, end - start).length() > 0)
                tokens.push_back(str.substr(start, end - start));
        }

        // If at end, use start=maxSize.  Else use start=end+delimiter.
        start = ((end > (std::string::npos - sizeof(char)))
            ? std::string::npos : end + sizeof(char));
    }
    return tokens;
}
Ivan Rubinson
  • 3,001
  • 4
  • 19
  • 48

2 Answers2

1

This sounds like a job for a regex_iterator! For context free languages, like the one you're trying to work with, it's hard to beat regexes. So rather than trying to wrangle your code into shape, throw it out, and use the right tool for the job.

This regex has distinct captures for each of your desired tokens:

\s*(?:\n|(#[^\n]*)|(\{)|(\})|(=)|([^{}= \t\r\n]+))

Live Example

Given an input like, const auto input = "#Comment\n\nshow_position = { x=-9 y =78 }"s You could parse it as simply as:

vector<Tokens> tokens;

for_each(sregex_iterator(cbegin(input), cend(input), re), sregex_iterator(), [&](const auto& i) {
    if (i[1].length() > 0U) {
        tokens.emplace_back(Token::COMMENT, i[1]);
    } else if (i[2].length() > 0U) {
        tokens.emplace_back(Token::BLOCK_OPEN, "{"s);
    } else if (i[3].length() > 0U) {
        tokens.emplace_back(Token::BLOCK_CLOSE, "}"s);
    } else if (i[4].length() > 0U) {
        tokens.emplace_back(Token::EQUALS, "="s);
    } else if (i[5].length() > 0U) {
        tokens.emplace_back(Token::LABEL, i[5]);
    }
});

Live Example

Jonathan Mee
  • 37,899
  • 23
  • 129
  • 288
-1

TL;DR: To fix this, you may add --end to the if statement in the default branch of your switch: IdeOne.

The problem here is that if a type of token you found is LABEL, you swallow one symbol more than you should. That's why the = symbol right after x is ignored. When you add a whitespace betwen them, this whitespace is ignored, and = sign is parsed correctly.

The symbol right after the LABEL type token is swallowed because of the following reason: you ignore the char #end. For all other types of tokens this is perfectly well since end in this case represents the last char of token, but for LABEL type of tokens end is equal to the number of first char after the token.

alexeykuzmin0
  • 6,344
  • 2
  • 28
  • 51