4

I have encountered a problem to read msg from a file using C++. Usually what people does is create a file stream then use getline() function to fetch msg. getline() function can accept an additional parameter as delimiter so that it return each "line" separated by the new delimiter but not default '\n'. However, this delimiter has to be a char. In my usecase, it is possible the delimiter in the msg is something else like "|--|", so I try to get a solution such that it accept a string as delimiter instead of a char.

I have searched StackOverFlow a little bit and found some interesting posts. Parse (split) a string in C++ using string delimiter (standard C++) This one gives a solution to use string::find() and string::substr() to parse with arbitrary delimiter. However, all the solutions there assumes input is a string instead of a stream, In my case, the file stream data is too big/waste to fit into memory at once so it should read in msg by msg (or a bulk of msg at once).

Actually, read through the gcc implementation of std::getline() function, it seems it is much more easier to handle the case delimiter is a singe char. Since every time you load in a chunk of characters, you can always search the delimiter and separate them. While it is different if you delimiter is more than one char, the delimiter itself may straddle between two different chunks and cause many other corner cases.

Not sure whether anyone else has faced this kind of requirement before and how you guys handled it elegantly. It seems it would be nice to have a standard function like istream& getNext (istream&& is, string& str, string delim)? This seems to be a general usecase to me. Why not this one is in Standard lib so that people no longer to implement their own version separately?

Thank you very much

Yang Xu
  • 53
  • 7
  • getline with a string would require lookahead, so it could be slower in general. Just speculation. We'll need to implement our own custom getline. – AndyG Aug 01 '17 at 21:44
  • Is there any elegant implementation. As you mentioned, lookahead makes the code complicated. maybe a FSM a elegant solution? – Yang Xu Aug 01 '17 at 21:52
  • The lookahead would be a simple FSM haha, just not as complicated as a regular expression. The gist of the program would be to read in characters until you reach the "delimiter" state and then parse those characters into a string. If you're just interested in a solution that "works", use a `std::vector` and play around. An "optimal" solution would be a little harder. If nobody's answered in a bit, I'll write something up. – AndyG Aug 01 '17 at 21:54
  • 1
    I would be tempted to `std::getline` to the first character of the delimiter string and buffer that read until you get the next read to test if you had the delimiter or not. If you did, store the buffer if not append to the buffer and continue. – Galik Aug 01 '17 at 21:57
  • @Galik:: I was thinking the same thing. – Remy Lebeau Aug 01 '17 at 22:12

3 Answers3

1

The STL simply does not natively support what you are asking for. You will have to write your own function (or find a 3rd party function) that does what you need.

For instance, you can use std::getline() to read up to the first character of your delimiter, and then use std::istream::get() to read subsequent characters and compare them to the rest of your delimiter. For example:

std::istream& my_getline(std::istream &input, std::string &str, const std::string &delim)
{
    if (delim.empty())
        throw std::invalid_argument("delim cannot be empty!"); 

    if (delim.size() == 1)
        return std::getline(input, str, delim[0]);

    str.clear();

    std::string temp;
    char ch;
    bool found = false;

    do
    {
        if (!std::getline(input, temp, delim[0]))
            break;

        str += temp;

        found = true;

        for (int i = 1; i < delim.size(); ++i)
        {
            if (!input.get(ch))
            {
                if (input.eof())
                    input.clear(std::ios_base::eofbit);

                str.append(delim.c_str(), i);
                return input;
            }

            if (delim[i] != ch)
            {
                str.append(delim.c_str(), i);
                str += ch;
                found = false;
                break;
            }
        }
    }
    while (!found);

    return input;
}
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
0

if you are ok with reading byte by byte, you could build a state transition table implementation of a finite state machine to recognize your stop condition

std::string delimeter="someString";
//initialize table with a row per target string character, a column per possible char and all zeros
std::vector<vector<int> > table(delimeter.size(),std::vector<int>(256,0));
int endState=delimeter.size();
//set the entry for the state looking for the next letter and finding that character to the next state
for(unsigned int i=0;i<delimeter.size();i++){
    table[i][(int)delimeter[i]]=i+1;
}

now in you can use it like this

int currentState=0;
int read=0;
bool done=false;
while(!done&&(read=<istream>.read())>=0){
    if(read>=256){
        currentState=0;
    }else{
        currentState=table[currentState][read];
    }
    if(currentState==endState){
        done=true;
    }
    //do your streamy stuff
}

granted this only works if the delimiter is in extended ASCII, but it will work fine for some thing like your example.

Austin_Anderson
  • 900
  • 6
  • 16
0

It seems, it is easiest to create something like getline(): read to the last character of the separator. Then check if the string is long enough for the separator and, if so, if it ends with the separator. If it is not, carry on reading:

std::string getline(std::istream& in, std::string& value, std::string const& separator) {
    std::istreambuf_iterator<char> it(in), end;
    if (separator.empty()) { // empty separator -> return the entire stream
        return std::string(it, end);
    }
    std::string rc;
    char        last(separator.back());
    for (; it != end; ++it) {
        rc.push_back(*it);
        if (rc.back() == last
            && separator.size() <= rc.size()
            && rc.substr(rc.size() - separator.size()) == separator) {
            return rc.resize(rc.size() - separator.size());
        }
    }
    return rc; // no separator was found
}
Dietmar Kühl
  • 150,225
  • 13
  • 225
  • 380