1

SO I have to split the phrase: "Hello, everyone! This is: COSC-1436, SP18" into separate tokens, dismissing any punctuation minus the dash. So the output should be:

Hello

everyone

This

is

COSC-1436

SP18

And I then must encrypt each token, which I got covered. I'm just having trouble using multiple delimiters. Here's what I have currently.

Function prototype: void tokenize(const string&, const string&, vector<string>&);

Function call: tokenize(code, " .,:;!?", tokens);

Function definition:

void tokenize(const string& str, const string& delim, vector<string>& tokens)
{
    int tokenStart = 0;

    int delimPos = str.find_first_of(delim);

    while(delimPos != string::npos)
    {
        string tok = str.substr(tokenStart, delimPos - tokenStart);

        tokens.push_back(tok);

        delimPos++;

        tokenStart = delimPos;

        delimPos = str.find_first_of(delim, delimPos);

        if(delimPos == string::npos)
        {
            string tok = str.substr(tokenStart, delimPos - tokenStart);

            tokens.push_back(tok);
        }   
    }
}

The only problem is that there are now tokens as blank spaces where the program encountered the punctuation marks. Any suggestions?

Tristan
  • 47
  • 1
  • 1
  • 9

2 Answers2

8

After you found your delimiter you should move your substring start to the char which is first_not_of your delimiter. Basically change:

delimPos++;

to:

delimPos = str.find_first_not_of(delim, delimPos + 1);

This will ensure that when you have 2 or more delimiters in sequence, the delimPos is moved beyond the last one.

Alternatively you can try this:

#include <iostream> 
#include <string>

int main()
{
    std::string str = "Hello, everyone! This is: COSC-1436, SP18";
    std::string const delims{ " .,:;!?" };

    size_t beg, pos = 0;
    while ((beg = str.find_first_not_of(delims, pos)) != std::string::npos)
    {
        pos = str.find_first_of(delims, beg + 1);
        std::cout << str.substr(beg, pos - beg) << std::endl;
    }

    return 0;
}

https://ideone.com/LJota9

Hello
everyone
This
is
COSC-1436
SP18
Killzone Kid
  • 6,171
  • 3
  • 17
  • 37
  • Thanks so much for the help man! But I'm a bit of a coding newbie and I would like to implement what you said into the code I already have so I can understand it fully. What would I change in my code exactly to start my substring to the char which is first_not_of my delimiter? – Tristan Mar 10 '18 at 22:54
  • @Tristan highlighted what you need to change to get it working in my answer – Killzone Kid Mar 10 '18 at 23:10
  • Dude thank you so much, you're a legend! It's all working now! – Tristan Mar 11 '18 at 17:43
6

You can just use std::regex_iterator since that's exactly what it was designed for.

#include <regex>
#include <iostream>
#include <string>

int main()
{
    const std::string s = "Hello, everyone! This is: COSC-1436, SP18";

    std::regex words_regex("[^\\s.,:;!?]+");
    auto words_begin = std::sregex_iterator(s.begin(), s.end(), words_regex);
    auto words_end = std::sregex_iterator();

    for (std::sregex_iterator i = words_begin; i != words_end; ++i)
        std::cout << (*i).str() << '\n';
}

The output of that complete program will be this.

Hello
everyone
This
is
COSC-1436
SP18
Stephen M. Webb
  • 1,705
  • 11
  • 18
  • 2
    That's pure C++11. – Stephen M. Webb Mar 09 '18 at 20:45
  • True, I somehow misread that you are using `std::regex` (not `std::basic_regex`). – SergeyA Mar 09 '18 at 20:48
  • Please not that regex is too much for this simple case and also ~30 times slower than Killzone Kid simple solution berlow (which is important for large files >10 MB) – Vit Oct 21 '19 at 14:20
  • 1
    "too much" is pretty subjective. I'd love to see your actual measurement data from a 10 MB file, given the cost of a regex is in compiling it and they tend to operate much faster and more efficiently (O(log n)) than brute-force linear comparisons (O(n^2)). – Stephen M. Webb Oct 21 '19 at 14:58