Using stringstream to tokenize a string with different delimeters

Question

How can you use stringstream to tokenize a line that looks like this.

[label] opcode [arg1] [,arg2]

The label may not always be there but if it isn't, there will be a white space. The opcode is always there and there is a space or tab in between opcode and arg1. Then there is no whitespace in between arg1 and arg2 but it is split by a comma.

Also, some blank lines will have white space on them so they need to be discarded. '#' is a comment

So for instance:

#Sample Input
TOP  NoP
     L   2,1
VAL  INT  0

This is just an example of the text file I'll be reading in from. So in label for line one would be TOP and opcode would = NOP with no arguments being passed.

I've been working on it but I need a simpler way to tokenize and from what I've seen, stringstream seems to be the one I'd like to use so if anyone can tell me sort of how to do this, I'd really appreciate it.

I've been racking my brain on how to do this and just to show you that I'm not just asking without working, here is my current code:

int counter = 0;
int i = 0;
int j = 0;
int p = 0;

while (getline(myFile, line, '\n'))
{


    if (line[0] == '#')
    {
        continue;
    }

    if (line.length() == 0)
    {
        continue;
    }

    if (line.empty())
    {
        continue;
    }

    // If the first letter isn't a tab or space then it's a label

    if (line[0] != '\t' && line[0] != ' ')
    {

        string delimeters = "\t ";

        int current;
        int next = -1;


        current = next + 1;
        next = line.find_first_of( delimeters, current);
        label = line.substr( current, next - current );

        Symtablelab[i] = label;
        Symtablepos[i] = counter;

        if(next>0)
        {
            current = next + 1;
            next = line.find_first_of(delimeters, current);
            opcode = line.substr(current, next - current);


            if (opcode != "WORDS" && opcode != "INT")
            {
                counter += 3;
            }

            if (opcode == "INT")
            {
                counter++;
            }

            if (next > 0)
            {
                delimeters = ", \n\t";
                current = next + 1;
                next = line.find_first_of(delimeters, current);
                arg1 = line.substr(current, next-current);

                if (opcode == "WORDS")
                {
                    counter += atoi(arg1.c_str());
                }
            }

            if (next > 0)
            {
                delimeters ="\n";
                current = next +1;
                next = line.find_first_of(delimeters,current);
                arg2 = line.substr(current, next-current);

            }
        }

        i++;

    }

    // If the first character is a tab or space then there is no label and we just need to get a counter
    if (line[0] == '\t' || line[0] == ' ')
    {
        string delimeters = "\t \n";
        int current;
        int next = -1;
        current = next + 1;
        next = line.find_first_of( delimeters, current);
        label = line.substr( current, next - current );

    if(next>=0)
        {
            current = next + 1;
            next = line.find_first_of(delimeters, current);
            opcode = line.substr(current, next - current);

            if (opcode == "\t" || opcode =="\n"|| opcode ==" ")
            {
                continue;
            }

            if (opcode != "WORDS" && opcode != "INT")
            {
                counter += 3;
            }

            if (opcode == "INT")
            {
                counter++;
            }


            if (next > 0)
            {
                delimeters = ", \n\t";
                current = next + 1;
                next = line.find_first_of(delimeters, current);
                arg1 = line.substr(current, next-current);

                if (opcode == "WORDS")
                {
                    counter += atoi(arg1.c_str());
                }

            }



            if (next > 0)
            {
                delimeters ="\n\t ";
                current = next +1;
                next = line.find_first_of(delimeters,current);
                arg2 = line.substr(current, next-current);

            }
        }

    }
}

myFile.clear();
myFile.seekg(0, ios::beg);

while(getline(myFile, line))
{
    if (line.empty())
    {
        continue;
    }

    if (line[0] == '#')
    {
        continue;
    }

    if (line.length() == 0)
    {
        continue;
    }



    // If the first letter isn't a tab or space then it's a label

    if (line[0] != '\t' && line[0] != ' ')
    {

        string delimeters = "\t ";

        int current;
        int next = -1;


        current = next + 1;
        next = line.find_first_of( delimeters, current);
        label = line.substr( current, next - current );


        if(next>0)
        {
            current = next + 1;
            next = line.find_first_of(delimeters, current);
            opcode = line.substr(current, next - current);



            if (next > 0)
            {
                delimeters = ", \n\t";
                current = next + 1;
                next = line.find_first_of(delimeters, current);
                arg1 = line.substr(current, next-current);

            }

            if (next > 0)
            {
                delimeters ="\n\t ";
                current = next +1;
                next = line.find_first_of(delimeters,current);
                arg2 = line.substr(current, next-current);

            }
        }

        if (opcode == "INT")
        {
            memory[p] = arg1;
            p++;
            continue;
        }

        if (opcode == "HALT" || opcode == "NOP" || opcode == "P_REGS")
        {
            memory[p] = opcode;
            p+=3;
            continue;
        }

        if(opcode == "J" || opcode =="JEQR" || opcode == "JNE" || opcode == "JNER" || opcode == "JLT" || opcode == "JLTR" || opcode == "JGT" || opcode == "JGTR" || opcode == "JLE" || opcode == "JLER" || opcode == "JGE" || opcode == "JGER" || opcode == "JR")
        {
            memory[p] = opcode;
            memory[p+1] = arg1;
            p+=3;
            continue;
        }

        if (opcode == "WORDS")
        {
            int l = atoi(arg1.c_str());
            for (int k = 0; k <= l; k++)
            {
                memory[p+k] = "0";
            }

            p+=l;
            continue;
        }

        else
        {
            memory[p] = opcode;
            memory[p+1] = arg1;
            memory[p+2] = arg2;
            p+=3;
        }

    }

    // If the first character is a tab or space then there is no label and we just need to get a counter        


    if (line[0] == '\t' || line[0] == ' ')
    {
        string delimeters = "\t ";
        int current;
        int next = -1;
        current = next + 1;
        next = line.find_first_of( delimeters, current);
        label = line.substr( current, next - current );

    if(next>=0)
        {
            current = next + 1;
            next = line.find_first_of(delimeters, current);
            opcode = line.substr(current, next - current);

            if (opcode == "\t" || opcode =="\n"|| opcode ==" "|| opcode == "")
            {
                continue;
            }



            if (next > 0)
            {
                delimeters = ", \n\t";
                current = next + 1;
                next = line.find_first_of(delimeters, current);
                arg1 = line.substr(current, next-current);

            }



            if (next > 0)
            {
                delimeters ="\n\t ";
                current = next +1;
                next = line.find_first_of(delimeters,current);
                arg2 = line.substr(current, next-current);

            }
        }

        if (opcode == "INT")
        {
            memory[p] = arg1;
            p++;
            continue;
        }

        if (opcode == "HALT" || opcode == "NOP" || opcode == "P_REGS")
        {
            memory[p] = opcode;
            p+=3;
            continue;
        }

        if(opcode == "J" || opcode =="JEQR" || opcode == "JNE" || opcode == "JNER" || opcode == "JLT" || opcode == "JLTR" || opcode == "JGT" || opcode == "JGTR" || opcode == "JLE" || opcode == "JLER" || opcode == "JGE" || opcode == "JGER" || opcode == "JR")
        {
            memory[p] = opcode;
            memory[p+1] = arg1;
            p+=3;
            continue;
        }

        if (opcode == "WORDS")
        {
            int l = atoi(arg1.c_str());
            for (int k = 0; k <= l; k++)
            {
                memory[p+k] = "0";
            }

            p+=l;

            continue;
        }

        else
        {
            memory[p] = opcode;
            memory[p+1] = arg1;
            memory[p+2] = arg2;
            p+=3;
        }
    }
}

I would obviously like to make this much much better so any help would be greatly appreciated.

If stringstream is really not enforced then I would recommend you use the reference from this answer http://stackoverflow.com/a/53863/1410711 — Recker, Sep 18 '12 at 00:21
Given this complex of input, you almost certainly want to start thinking in terms of a lexer and possibly parser. A couple possibilities include Flex/byacc, or Boost Spirit/Qi (though there are definitely more). — Jerry Coffin, Sep 18 '12 at 00:38
Can I use string stream to accomplish this task? Boost tokenizer is something I can't use right now. — cadavid4j, Sep 18 '12 at 00:54

jrok · Accepted Answer · 2012-09-19T10:41:22.437

Before you go mad with maintaining those huge if statemenets or trying to learn Boost Spirit, let's try to write a very simple parser. This is a bit of a long post, and doesn't get directly to the point so please bear with me.

First, we need a grammar, which seems to be dead simple:

    line
          label(optional)   opcode   argument-list(optional)

    argument-list
          argument
          argument, argument-list

In english: A line of code consists of an optional label, an opcode and an optional argument list. Arguments list is either a single argument (an integer) or an argument followed by a separator (comma) and another argument list.

Let's first define two datastructures. Labels are supposed to be unique (right?), so we'll have a set of strings so we can easily look them up at any time and possibly report an error if we find a duplicate label. The next one is a map of strings to size_t, which acts as a symbol table of valid opcodes together with expected number of arguments for each opcode.

std::set<std::string> labels;
std::map<std::string, size_t> symbol_table = {
    { "INT", 1},
    { "NOP", 0},
    { "L",   2}
};

I don't know what exactly is memory in your code, but your way of calculating offsets to figure where to put arguments seems unneccesarily complicated. Let's define a data structure that can elegantly hold a line of code instead. I'd do something like this:

typedef std::vector<int> arg_list;

struct code_line {
    code_line() : label(), opcode(), args() {}
    std::string  label;      // labels are optional, so an empty string
                             // will mean absence of label
    std::string  opcode;     // opcode, doh
    arg_list     args;       // variable number of arguments, it can be empty, too.
                             // It needs to match with opcode, we'll deal with
                             // that later
};

A syntax error is kind of an exceptional circumstance that's not easily recoverable, so let's deal with them by throwing exceptions. Our simple exception class can look like this:

struct syntax_error {
    syntax_error(std::string m) : msg(m) { }
    std::string msg;
};

Tokenizing, lexing and parsing are usualy separated tasks. But I guess for this simple example, we can combine tokenizer and lexer in one class. We already know the elements our grammer is made of, so let's write a class that'll take input as text and extract grammar elements from it. The interface could look like this:

class token_stream {
    std::istringstream stream; // stringstream for input
    std::string buffer;        // a buffer for a token, more on this later
public:
    token_stream(std::string str) : stream(str), buffer() { }

    // these methods are self-explanatory
    std::string get_label();
    std::string get_opcode();
    arg_list get_arglist();

    // we're taking a kind of top-down approach with this,
    // so let's forget about implementations for now
};

And the work horse, a function that tries to makes sense of tokens and returns a code_line struct if everything goes fine:

code_line parse(std::string line)
{
    code_line temp;
    token_stream stream(line);

    // Again, self-explanatory, get a label, opcode and argument list from
    // token stream.

    temp.label = stream.get_label();
    temp.opcode = stream.get_opcode();
    temp.args = stream.get_arglist();

    // Everything went fine so far, remember we said we'd be throwing exceptions
    // in case of syntax errors.

    // Now we can check if we got the correct number of arguments for the given opcode:

    if (symbol_table[temp.opcode] != temp.args.size()) {
        throw syntax_error("Wrong number of parameters.");
    }

    // The last thing, if there's a label in the line, we insert it in the table.
    // We couldn't do that inside the get_label method, because at that time
    // we didn't yet know if the rest of the line is sintactically valid and a
    // exception thrown would have left us with a "dangling" label in the table.

    if (!temp.label.empty()) labels.insert(temp.label);

    return temp;
}

And here's how we might use all this:

int main()
{
    std::string line;
    std::vector<code_line> code;

    while (std::getline(std::cin, line)) {

        // empty line or a comment, ignore it
        if (line.empty() || line[0] = '#') continue;

        try {
            code.push_back(parse(line));
        } catch (syntax_error& e) {
            std::cout << e.msg << '\n';

            // Give up, try again, log... up to you.
        }
    }
}

If the input was succesfuly parsed, we now got a vector of valid lines with all the info (labels, number of arguments) and can do pretty much anything we like with it. This code will be much easier to mantain and extend than yours, IMO. If you need to introduce a new opcode, for example, just make another entry in the map (symbol_table). How's that compared to your ifstatements? :)

The only thing left is the actual implementation of the token_streams methods. Here's how I did it for get_label:

std::string token_stream::get_label()
{
    std::string temp;

    // Unless the stream is empty (and it shouldn't be, we checked that in main),
    // operator>> for std::string is unlikely to fail. It doesn't hurt to be robust
    // with error checking, though

    if (!(stream >> temp)) throw ("Fatal error, empty line, bad stream?");

    // Ok, we got something. First we should check if the string consists of valid
    // characters - you probably don't want punctuation characters and such in a label.
    // I leave this part out for simplicity.

    // Since labels are optional, we need to check if the token is an opcode.
    // If that's the case, we return an empty (no) label.

    if (symbol_table.find(temp) != symbol_table.end()) {
        buffer = temp;
        return "";
    }

    // Note that above is where that `buffer` member of token_stream class got used.
    // If the token was an opcode, we needed to save it so get_opcode method can make
    // use of it. The other option would be to put the string back in the underlying 
    // stringstream, but that's more work and more code. This way, get_opcode needs   
    // to check if there's anything in buffer and use it, or otherwise extract from
    // the stringstream normally.

    // Check if the label was used before:

    if (labels.count(temp))
        throw syntax_error("Label already used.");

    return temp;
}

And that's it. I leave the rest of the implementation as an exercise for you. Hope it helped. :)

score 1 · Answer 2 · answered Sep 18 '12 at 00:23

1

You definitely need regular expressions such as boost regex; or lexical analysis and parsing tools such as lex/yacc, flex/bison or boost spirit for this version of the question.

It isn't worth the maintenance at this complexity to stay with strings and streams.

answered Sep 18 '12 at 00:23

Jonathan Seng

1,219
7
20

I'm not sure how to use boost. Whenever I put #include I get this error `boost/tokenizer.hpp: No such file or directory compilation terminated. ` – cadavid4j Sep 18 '12 at 00:32
http://www.boost.org/doc/libs/1_51_0/more/getting_started/index.html -- And select the Getting Started on for your platform. And of course, find the page for the boost library you want to use. – Jonathan Seng Sep 18 '12 at 00:33
If I try to compile this source code on another machine will it compile? If not, then boost is out of the question. – cadavid4j Sep 18 '12 at 00:35
You would need to have boost available in each environment you build in. But, any tool you use you will need to have on each platform you build on. Frankly, you are better off having boost since it makes things a lot easier in a lot of areas. With a little research, you could probably put boost (or whatever tool) in your source code distribution -- though don't bind it too closely. – Jonathan Seng Sep 18 '12 at 00:37
Maybe you can help me with this then, in my current code, how can I get rid of a line that only has white space in it. It works correctly all except for if there is a line with nothing in it except for tabs or spaces. – cadavid4j Sep 18 '12 at 01:00
1

`if(line.find_first_not_of(delimiters) != string::npos)` will tell you if there is anything but white space on the line. See [string::npos](http://en.cppreference.com/w/cpp/string/basic_string/find_first_not_of). This works by requesting the position of the first character not in delimiters. `string::npos` is returned for a failure to match. But, as the answer says, you definitely need something stronger or you will be writing an awful lot of code (and thus more bugs and more maintenance) more easily handled with other tools. – Jonathan Seng Sep 18 '12 at 01:21
thank you very much. That's what I needed to make it work. Also thank you for the advice on alternative ways to do it. I'll be making changes as I look up the boost library and what not. I appreciate it. – cadavid4j Sep 18 '12 at 01:50

Using stringstream to tokenize a string with different delimeters

2 Answers2