Split text with array of delimiters

Question

I want a function that split text by array of delimiters. I have a demo that works perfectly, but it is really really slow. Here is a example of parameters.

text:

"pop-pap-bab bob"

vector of delimiters:

"-"," "

the result:

"pop", "-", "pap", "-", "bab", "bob"

So the function loops throw the string and tries to find delimeters and if it finds one it pushes the text and the delimiter that was found to the result array, if the text only contains spaces or if it is empty then don't push the text.

std::string replace(std::string str,std::string old,std::string new_str){
    size_t pos = 0;
    while ((pos = str.find(old)) != std::string::npos) {
        str.replace(pos, old.length(), new_str);
    }
    return str;
}


std::vector<std::string> split_with_delimeter(std::string str,std::vector<std::string> delimeters){
    std::vector<std::string> result;
    std::string token;
    int flag = 0;
    for(int i=0;i<(int)str.size();i++){
        for(int j=0;j<(int)delimeters.size();j++){
            if(str.substr(i,delimeters.at(j).size()) == delimeters.at(j)){
                if(token != ""){
                    result.push_back(token);
                    token = "";
                }
                if(replace(delimeters.at(j)," ","") != ""){
                    result.push_back(delimeters.at(j));
                }
                i += delimeters.at(j).size()-1;
                flag = 1;
                break;
            }
        }
        if(flag == 0){token += str.at(i);}
        flag = 0;
    }
    if(token != ""){
        result.push_back(token);
    }
    return result;
}

My issue is that, the functions is really slow since it has 3 loops. I am wondering if anyone knows how to make the function faster. I am sorry, if I wasn't clear enough my english isn't the best.

[`std::find_first_of`](https://en.cppreference.com/w/cpp/algorithm/find_first_of). Its not automatically faster, but likely to make your code simpler. Do you really need `std::string` delimiters? In your example they are all single characters — 463035818_is_not_an_ai, Nov 15 '22 at 09:57
how slow is "really slow" ? For your example input I would not expect much difference even from a more efficient algorithm — 463035818_is_not_an_ai, Nov 15 '22 at 09:58
@463035818_is_not_a_number it has to be string because in my situation there are more then one character sometimes it is 5 or over. And when you have over 100,000 chracters that I have to loop through it takes over a minute to loop through — Cheese Danish, Nov 15 '22 at 10:00
not sure if I understand your code. You seem to construct substring to compare them with the delimiters, when you can simply call `std::string::find`. That alone might result in a speedup, because constructing substrings is expensive. Though before trying to optimize manually, did you turn on compiler optimizations? — 463035818_is_not_an_ai, Nov 15 '22 at 10:04
@463035818_is_not_a_number I am not sure what do you mean by "turn on compiler optimizations" and I am not sure how to implement the std::string::find in the function, could you please help me with that. I am really new to programming — Cheese Danish, Nov 15 '22 at 10:06
imho this is too broad. imho you should start with "How to split one string with a vector of delimiters?" or even simpler "How to split a string with a string delimiter?", because your problems start already there. https://stackoverflow.com/questions/14265581/parse-split-a-string-in-c-using-string-delimiter-standard-c — 463035818_is_not_an_ai, Nov 15 '22 at 10:09
concerning the desire to get the code run faster, you need to learn about how to turn on compiler optimizations. Before that it is pointless to try to modify the code aiming to get it faster — 463035818_is_not_an_ai, Nov 15 '22 at 10:10

score 0 · Answer 1 · answered Nov 15 '22 at 10:20

Maybe, as an alternative, you could use a regex? But maybe also too slow for you . . .

With a regex life would be very simple.

Please see the following example:

#include <iostream>
#include <string>
#include <vector>
#include <regex>
#include <iterator>

const std::regex re(R"((\w+|[\- ]))");

int main() {
    
    std::string s{"pop-pap-bab bob"};
    
    std::vector<std::string> part{std::sregex_token_iterator(s.begin(),s.end(),re),{}};
    
    for (const std::string& p : part)   std::cout << p << '\n';
}

We use the std::sregex_token_iterator in combination with the std::vectors range constructor, to extract everything specified in the regex and then put all those stuff into the std::vector

The regex itself is also simple. It specifies words or delimiters.

Maybe its worth a try . . .

The program is much faster, but could you please tell how cant i turn vector of strings into regex because i am not sure how can I put my items in "std::regex re(R"((\w+|[\- ]))");". For example I have vector items = {"==","!=",">=","<=","<",">","||","&&",} — Cheese Danish, Nov 15 '22 at 10:29
@CheeseDanish you should create string that looks like this `"[(delim1)(delim2)...(delim100)]"` and use it as a regex — Deumaudit, Nov 15 '22 at 11:06

Deumaudit · Answer 2 · 2022-11-15T10:50:59.910

NOTE: You've complained that your code is slow, but it's important to understand that most of the answers will have options to potentially speed up the program. And even if the author of the option measured the acceleration of the program, the option may be slower on your machine, so do not forget to measure the execution speed yourself.

If I were you, I would create a separate function that receives an array of strings and outputs an array of delimited strings. The problem with this approach may be that if the delimiter includes another delimiter, the result may not be what you expect, but it will be easier to iterate through different options for string splitting, finding the best. And my solution would looks like this(though, it requires c++20)

#include <iomanip>
#include <iostream>
#include <ranges>
#include <string_view>
#include <vector>

std::vector<std::string> split_elems_of_array(const std::vector<std::string>& array, const std::string& delim)
{
    std::vector<std::string> result;
    for(const auto str: array)
    {
        for (const auto word : std::views::split(str, delim))
        {
            std::string chunk(word.begin(), word.end());
            if(!chunk.empty() && chunk != " ")
                result.push_back(chunk + delim);
        }
    }

    return result;
}

std::vector<std::string> split_string(std::string str, std::vector<std::string> delims)
{
    std::vector<std::string> result = {std::string(str)};
    for(const auto&delim: delims)
        result = split_elems_of_array(result, delim);
    return {result.begin(), result.end()};
}

For my machine, my approach is 56 times faster: 67 ms versus 5112 ms. Length of string is 1000000, there are 100 delims with length 100

Thank you but some reasone even though I am using c++20 it gives me error error: no member named 'views' in namespace 'std' for (const auto word: std::views::split(str, delim)){ — Cheese Danish, Nov 15 '22 at 11:00
@CheeseDanish what compiler are you using? What version(g++10.2 or smt else). — Deumaudit, Nov 15 '22 at 11:03
cmake_minimum_required(VERSION 3.24.2) project(Soy_Lang) set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_SOURCE_DIR}/CMakeFiles/") file(GLOB SOURCE src/*.cpp) file(GLOB HEADER include/*.hpp) add_executable(Soy_Lang ${HEADER} ${SOURCE} ) set_source_files_properties(${SOURCE} PROPERTIES COMPILE_FLAGS "-std=c++20") include_directories(${SDL2_INCLUDE_DIRS} ${SDL2_IMAGE_INCLUDE_DIRS} ${SDL2_TTF_INCLUDE_DIRS}) my cmake file — Cheese Danish, Nov 15 '22 at 11:41
@CheeseDanish g++4.6.3 does not support ranges. If you want to compile exactly my solution, you need [g++12](https://godbolt.org/z/derdPcT48). You may modify `split_elems_of_array` in a way, that will not require ranges(examples of such is in the link in original answer), so you could compile code yourself — Deumaudit, Nov 15 '22 at 12:22

score 0 · Answer 3 · answered Nov 15 '22 at 11:42

Here is the algorithm of standard splitting. if you split pop-pap-bab bob by {'-' , ' '} it gives you ["pop", "pap", "bab", "bob"] it's not storing delimiters and doesn't check for empty text. You can change it to do those things too.

Define a vector of strings named result.
Define a string variable named buffer.
Loop over your string, if current character is not a delimiter append it to buffer.
if current character is a delimiter, append buffer to result.
Return result at the end.

std::vector<std::string> split(std::string str, std::vector<char> delimiters)
  {
    std::vector<std::string> result;
    std::string buffer;

    for (const auto ch : str)
    {
      if (std::find(delimiters.begin(), delimiters.end(), ch) == delimiters.end())
        buffer += ch;

      else
      {
        result.insert(result.end(), buffer);
        buffer.clear();
      }
    }

    if (buffer.length())
      result.insert(result.end(), buffer);

    return result;
  }

It's time complexity is O(n.m). n is the length of string and m is the length of delimiters.

Thank you but what if I want to keep the delimiter if it is found? — Cheese Danish, Nov 15 '22 at 11:44
You can store `std::find` result in a variable. `auto founded_iterator = std::find(delimiters.begin(), delimiters.end(), ch);` and then dereference it with `*` oprerator: `*founded_iterator`. — Ali Moameri, Nov 15 '22 at 11:58

shy45 · Accepted Answer · 2022-11-16T12:55:20.953

It might be a good idea to use boost expressive. It is a powerful tool for various string operations more than struggling with string::find_xx and self for-loop or regex.

Concise explanation: +as_xpr(" ") is repeated match more than 1 like regex and then prefix "-" means shortest match. If you define regex parser as sregex rex = "(" >> (+_w | +"_") >> ":" >> +_d >> ")", it would match (port_num:8080). In this case, ">>" means the concat of parsers and (+_w | +"_") means that it matches character or "_" repeatedly.

#include <vector>
#include <string>
#include <iostream>
#include <boost/xpressive/xpressive.hpp>
using namespace std;
using namespace boost::xpressive;

int main() {
    string          source = "Nigeria is a multi&&national state in--habited by more than 2;;50 ethnic groups speak###ing 500 distinct languages";
    vector<string>  delimiters{ " ", "  ", "&&", "-", ";;", "###"};

    vector<sregex>  pss{ -+as_xpr(delimiters.front()) };
    for (const auto& d : delimiters) pss.push_back(pss.back() | -+as_xpr(d));

    vector<string>  ret;
    size_t          pos = 0;
    auto push       = [&](auto s, auto e) { ret.push_back(source.substr(s, e)); };
    for_each(sregex_iterator(source.begin(), source.end(), pss.back()), {}, [&](smatch const& m) {
            if (m.position() - pos) push(pos, m.position() - pos);
            pos = m.position() + m.str().size();
        }
    );
    push(pos, source.size() - pos);
    for (auto& s : ret) printf("%s\n", s.c_str());
}

Output is splitted by multiple string delimiers.

Nigeria
is
a
multi
national
state
in
habited
by
more
than
2
50
ethnic
groups
speak
ing
500
distinct
languages

Split text with array of delimiters

4 Answers4