Is there an alternative to the std::string substring?

Question

Given a string s = "RADILAMIA" I want to take all the substrings of length 4 (or something else).

If len == 4 then the substrings are: "RADI","ADIL","DILA","ILAM","LAMI","AMIA". It seems easy to do that by using the std::string substr method:

vector<string> allSubstr(string s,int len) {
    vector<string>ans;
    for(int i=0;i<=s.size()-len;i++) {
        ans.push_back(s.substr(i,len));
    }
    return ans;
}

substr's time complexity is unspecified, but generally linear against the length of the substring.

Can I do this without std::string substr. Any substring and the previous substring differ in only one letter. Is there any better way to reduce the time complexity?

What is the purpose of `temp` here? Also, why the condition if for loop is `s.size()-len`? Can you explain a bit more? Your question says substring of length `4` but you have calculated substrings of length `10`. — Shubham, Dec 13 '16 at 07:54
You pass `s` by value, so the only way you are going to be able to return a bunch of substrings is to copy them into `ans`. `s` will go out of scope. To make this faster, we'd need to see the code that calls this. If you made `s` a `const&`, then you could push pairs of string iterators into ans, and use them as ranges of the string. — doug65536, Dec 13 '16 at 08:03
Do you need actual strings? In that case, there isn't a better solution (complexity wise), because you always have to copy those 4 characters. An alternative would be only to store string views (start and length of each substring). Implementation wise you can tweak the code a bit - in particular `temp` is unnecessary. — MikeMB, Dec 13 '16 at 08:05
@Shubham The loop is `s.size() - len` because he wants it to end with `"AMIA"`, not `"AMIA", "MIA", "IA", "A", ""` — doug65536, Dec 13 '16 at 08:10
is it possible to run the program in O(N) time complexity where N is the string length? — Atif, Dec 13 '16 at 08:28
@MikeMB Yes i need actual substrings but i observe that the character difference of every consecutive substring only one because previous substring's first character will be removed and only one new character append in current substring. If i store the first substring like "RADI" then i can get next substring from first only remove 'R' and add 'L' in last.i think it's the sliding window technique.Can i implement this technique in this problem? — Atif, Dec 13 '16 at 08:41
@Atif You *assert* that you actually need copies of the input. What you probably need is a pair of iterators that increment across the input, where the second iterator is `len` characters ahead of the first iterator. Again, we would need to see the usage of this function's result to suggest a real improvement. dalle's answer is the closest you can get to the iterator pair trick without knowing what is going on in the caller. — doug65536, Dec 13 '16 at 08:44
@Alf if you need multiple independent strings it doesn't matter how they are related, because you always have to copy all 4 characters into the new string. All you can do is try to minimize the scaffolding around it. — MikeMB, Dec 13 '16 at 08:51
@Atif - "is it possible to run the program in O(N) time complexity where N is the string length? " The best you can hope is O(N*len) because you want the substrings in a vector and each copy of the substring will result in `len` characters being copied anyway. — Adrian Colomitchi, Dec 13 '16 at 09:01
@doug65536 in reality a pair of iterators will be as big, if not bigger than a copied string (at least on a 64 bit machine) — Richard Hodges, Dec 13 '16 at 09:02
@RichardHodges As big? Would a pair of iterators call malloc repeatedly? Would a pair of iterators copy the data redundantly into an allocated (vector) block holding bunch of separate allocated (string) blocks? — doug65536, Dec 13 '16 at 09:26
@doug65536 a pair of iterators on a 64-bit machine is 16 bytes. A string less that ~20 bytes long won't allocate any memory because of SSO (the string class re-purposes its internal memory when it detects that it is shorter than this threshold). Copying a small string in c++ is of the same order of complexity and timing as manipulating the iterators. It's why we should always favour simple code. The library implementors have already thought about our petty performance concerns. — Richard Hodges, Dec 13 '16 at 09:37
@Richard: Right. Although from the Question it is not clear if the strings in the real applications are also that short. Also the actual code the compiler has to optimize when working with strings is more complex (branchy) when working with strings. — MikeMB, Dec 13 '16 at 10:05
@MikeMB first point I accept. Second is not necessarily the case. an implementor could reasonably instruct the cpu to predict that short strings are the common case (since for long strings misprediction is less of an issue). In reality of course, `std::string` is a template, so the optimiser can see through all local operations. Any decision-based cost for non-mutating algorithms will be paid at most once per discrete code evaluation context. In this particular case, the cost need never be paid since the short string code path can be selected at compile time. — Richard Hodges, Dec 13 '16 at 10:15
@Richard: I didn't say the resulting binary code was more complex (although it was in my test). Just that the optimizer has to do more work, which might lead to different inlining decisions etc.. A view/pair of pointers is just a simple, trivially copyable and destructible datastructure and the pointers don't even get dereferenced in that function. It would be pretty hard for a SSO to beat that. — MikeMB, Dec 13 '16 at 18:56
@MikeMB understood. I accept that we can save a few instructions by using an iterator pair or string_view. Whether the extra logic complexity and dangling dependencies are worth it is another matter. — Richard Hodges, Dec 13 '16 at 19:03

Kadir Erdem Demir · Answer 1 · 2016-12-13T08:37:13.237

1

There can be millions of different approaches. Here is my algorithm.

vector<string> allSubstr(string s,int len) {

    vector<string>ans;
    ans.reserve(s.size() - len );
    for(size_t i=0;i<=s.size()-len;i++) 
    {
        ans.emplace_back( s.begin() +i, s.begin() + i + len );
    }

    return ans;
}

It is tested. I mean it wouldn't matter what you are using but emplace_back above can make a difference since there won't be copy cost. Also you add reserve for more performance.

edited Dec 13 '16 at 08:37

answered Dec 13 '16 at 08:06

Kadir Erdem Demir

3,531
3
28
39

1

Pardon me for asking, what is the time complexity of the `std::rotate`? – Adrian Colomitchi Dec 13 '16 at 08:09
1

And how is rotating the original string better than just copying from an offset? You are doing additional work, that gains you nothing – MikeMB Dec 13 '16 at 08:10
It is linear http://en.cppreference.com/w/cpp/algorithm/rotate. And it is even not needed here. I just used it for the beauty of it. "i" and "len" can be manipulated and it can be removed. – Kadir Erdem Demir Dec 13 '16 at 08:11
1

Using rotate makes it `O(n^2)` because it copies n things n times. – doug65536 Dec 13 '16 at 08:13
OK I were a bit excited about rotate . I removed it now. – Kadir Erdem Demir Dec 13 '16 at 08:33

user31264 · Answer 2 · 2016-12-13T08:47:04.730

1

No matter what you do, you still need O(NL) time to write all your substrings into the vector.

The fastest thing would be probably:

vector<string> ans(s.size()-len);
for(int i=0;i<=s.size()-len;i++) {
    ans[i] = s.substr(i, len);
}

Because push_back is slowish, and should generally be avoided if possible. It is overused.

PS: maybe this code would be even faster:

vector<string> ans(s.size()-len);
for(int i=0;i<=s.size()-len;i++) {
    ans[i].append(s.begin()+i, s.begin()+i+len);
}

edited Dec 13 '16 at 08:47

answered Dec 13 '16 at 08:13

user31264

6,557
3
26
40

1

`vector ans[s.size()-len];` VLAs aren't standard c++. – πάντα ῥεῖ Dec 13 '16 at 08:15
Fixed the brackets. – user31264 Dec 13 '16 at 08:16
@dalle - yes, fixed this too. I made 2 errors in 4 lines, shame on me. – user31264 Dec 13 '16 at 08:25
i observe that the character difference of every consecutive substring only one because previous substring's first character will be removed and only one new character append in current substring. If i store the first substring like "RADI" then i can get next substring from first only remove 'R' and add 'L' in last.i think it's the sliding window technique.Can i implement this technique in this problem? – Atif Dec 13 '16 at 08:44
@Atif - if you "only" remove R, such a removal takes linear time, so I doubt you gain anything. – user31264 Dec 13 '16 at 08:46

dalle · Answer 3 · 2016-12-13T08:18:30.583

1

string_view (C++17) has a constant time substr:

vector<string_view> allSubstr(const string_view& s, int len) {
    vector<string_view> ans;
    and.reserve(s.size() - len + 1);
    for (int i = 0 ; i <= s.size() - len; ++i) {
        ans.push_back(s.substr(i, len));
    }
    return ans;
}

Just make sure that s outlives the return value of the function.

edited Dec 13 '16 at 08:18

answered Dec 13 '16 at 08:16

dalle

18,057
5
57
81

`const string_view& s` has to become `string_view s`, Probably the safe and standard footprint would be : `vector & allSubstr( vector & result, string_view s, const size_t len)` – Chef Gladiator Aug 18 '20 at 07:46

score 0 · Answer 4 · answered Dec 13 '16 at 08:03

0

Probably you could use an array of chars instead. For example, you have got your word:

char s[] = "RADILAMIA";

To deal with all necessary substrings you can use such approach:

int substLength = 4;
int length = strlen(s);
char buffer[256];
for (int i = 0; i < length - substLength + 1; i++) {
    strncpy(buffer, s + i, substLength);
    buffer[substLength] = '\0';
    cout << buffer << endl;
}

Using the char array you easily can access to the start of any substring by adding the necessary index to the beginning of the array.

answered Dec 13 '16 at 08:03

Fomalhaut

8,590
8
51
95

2

That is not C++. That is C that uses `cout`. – doug65536 Dec 13 '16 at 08:05
2

What is the time complexity of the `strncpy` function again? – Adrian Colomitchi Dec 13 '16 at 08:07
1

This code works in C++ as well. You are not forbidden of the use of char arrays or `strncpy` in C++. I wrote this code because it works faster than a code using `std::string` object from C++. – Fomalhaut Dec 13 '16 at 08:09
Actually, you don't have to use `strncpy`, instead you can refer to the beginning of your substring within the given string. The complexity of `strncpy` is `O(m)` where `m` is the length of the destination. Read more: http://www.cplusplus.com/reference/cstring/strncpy/ – Fomalhaut Dec 13 '16 at 08:13
3

@Fomalhaut _"... because it works faster than a code using `std::string` object from C++"_ Sure about that? Did you measure? – πάντα ῥεῖ Dec 13 '16 at 08:14
You can measure it by yourself or do your own research about it. An example of research: http://stackoverflow.com/questions/21946447/how-much-performance-difference-when-using-string-vs-char-array – Fomalhaut Dec 13 '16 at 08:17
"The complexity of strncpy is O(m) where m is the length of the destination. " So your total complexity is O(N*m), right? How's this different from the solution in the OP with substrings? – Adrian Colomitchi Dec 13 '16 at 08:17
What is `N`? I was asked about the complexity of `strncpy`. – Fomalhaut Dec 13 '16 at 08:18
"What is N?" The length of the original string. – Adrian Colomitchi Dec 13 '16 at 08:19
@Fomalhaut Sorry, but you should post **your** implementation of the same code, but using `std::string`. Don't link to other code that does something entirely different. – PaulMcKenzie Dec 13 '16 at 08:20
OP substrings would work slower. But if you are satisfied with their speed, you can use them of course, because they are more convenient to use in other pieces of code. – Fomalhaut Dec 13 '16 at 08:21
Quite frankly, I'd also expect this to be faster (ignoring the `std::cout` and although `strncopy` does more work than needed here). In particular it saves dynamic memory allocations from the `std::vector` (depending on the size also from `std::string`) and doesn't have any temporaries (which the compiler might or might not get rid of in the OP's exmple). **But** you don't return a array of strings, as the OP does, so it isn't really a solution to do the same thing faster. – MikeMB Dec 13 '16 at 08:26

Adrian Colomitchi · Answer 5 · 2016-12-13T09:03:44.297

It pays to revisit the docos

// string proto(len);
vector<string> result(s.size()-len, string(len, char(32))); // preallocates the buffers

const char *str=s.c_str();
const char* end=str+s.size()-len;

for(size_t i=0; str<end; str++, i++) {
  result[i].assign(str, len); // likely to result in a simple copy in the preallocated buffer
}

The complexity is the same O(len*s.size()) - one can only hope for a smaller proportionality factor.

Chef Gladiator · Answer 6 · 2020-08-17T21:21:15.063

C is not always faster than C++ but @Fomalhaut was right to post the performant core solution in C. Here is my (C program) complete version, based on his algorithm. Without using strncpy, too.

Here it is on the godbolt.

#ifdef __STDC_ALLOC_LIB__
#define __STDC_WANT_LIB_EXT2__ 1
#else
#define _POSIX_C_SOURCE 200809L
#endif

#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
#include <assert.h>
#include <malloc.h>

//////////////////////////////////////////////////////////////
// array of buffers == a_of_b
typedef struct a_of_b  {
    const unsigned size;
    unsigned count ;
    char ** data   ;
} a_of_b ;

a_of_b a_of_b_make ( const unsigned size_ ) 
{
   return (a_of_b){ .size = size_, .count = 0, .data = calloc(1, sizeof(char * [size_] ) ) } ;
}

a_of_b * a_of_b_append ( a_of_b * self,  const unsigned len_, const char str_[len_] ) 
{
    assert( self->data ) ;
    assert( self->size > self->count ) ;
    self->data[ self->count ] = strndup( str_, len_ ) ;
    self->count += 1;
    return self ;
}

a_of_b * a_of_b_print ( a_of_b * self , const char * fmt_ ) 
{
    for (unsigned j = 0; j < self->count; ++j)
         printf( fmt_ , self->data[j]);
    return self ;
}

a_of_b * a_of_b_free ( a_of_b * self  ) 
{
    for (unsigned j = 0; j < self->count; ++j)
         free( self->data[j]) ;
    free( self->data) ;
    self->count = 0  ;         
    return self ;
}
//////////////////////////////////////////////////////////////
a_of_b breakit ( const unsigned len_, const char input_[len_], const unsigned  substLength )
{
    assert( len_ > 2 ) ;
    assert( substLength > 0 ) ;
    assert( substLength < len_ ) ;

    const unsigned count_of_buffers = len_ - substLength + 1;

    a_of_b rez_ = a_of_b_make( count_of_buffers +1 ) ;

    for (int i = 0; i < count_of_buffers ; i++) {
        a_of_b_append( &rez_, substLength, input_ + i ) ;
    }

   return rez_ ;
}
//////////////////////////////////////////////////////////////
static void driver( const char * input_, const unsigned substLength ) 
{
    printf("\n");
    a_of_b substrings = breakit( strlen(input_), input_, substLength );
    a_of_b_print( & substrings , "%s ");
    a_of_b_free( & substrings);
}
//////////////////////////////////////////////////////////////
int main () { 

    driver( "RADILAMIA", 4) ;
    driver( "RADILAMIA", 3) ;
    driver( "RADILAMIA", 2) ;
    driver( "RADILAMIA", 1) ;
    
    return EXIT_SUCCESS; 
}

And the program output is:

RADI ADIL DILA ILAM LAMI AMIA 

RAD ADI DIL ILA LAM AMI MIA 

RA AD DI IL LA AM MI IA 

R A D I L A M I A

Enjoy.

Is there an alternative to the std::string substring?

6 Answers6