24

The generic algorithm to deduce if a string contains all unique characters (and that does not use any other data structures) says to go through the string, iterating each letter against the entire string searching for a match. This approach is O(n^2).

The approach below (written in C) uses an offset for the iterating over the string part, since for example in a short string there is no reason to test the last character with the first character as the first character already did that.

My question is this: Is the runtime of the algorithm then O(n!) or something like O(nlogn)?

#include <stdio.h>

int strunique(const char *str)
{
    size_t offset = 1;
    char *scout = (char *)str, *start;

    for (; *scout != '\0'; ++scout, ++offset)
        for (start = (char *)str + offset; *start != '\0'; ++start)
            if (*start == *scout)
                return 0;

    return 1;
}

int main(void)
{
    printf("%d\n", strunique("uniq"));
    printf("%d\n", strunique("repatee"));

    return 0;
}
quickerque
  • 243
  • 2
  • 6
  • Why not have an array of size 2^`CHAR_BIT` and use it as a counter, stopping as soon as any counter exceeds 1? – Kerrek SB May 21 '15 at 00:38
  • 2
    What does "O(n + 1/2n)" mean? Do you know what "O" means? – Kerrek SB May 21 '15 at 00:39
  • 6
    If you have two horses and want to know which runs faster, you should try asking this question on [whichhorseisfaster.stackexchange.com](http://ericlippert.com/2012/12/17/performance-rant/)... – nhgrif May 21 '15 at 00:39
  • 6
    Your algorithm takes about half as long as the naive algorithm. As the difference is a constant factor, your algorithm is still O(n*n). – jxh May 21 '15 at 00:44
  • 1
    Runtime complexity has no direct relation to how long it will take an algorithm to execute. – Collin Dauphinee May 21 '15 at 00:48
  • @CollinDauphinee: You are correct. Pedantically, the proposed algorithm performs about half as many comparisons as the naive algorithm. – jxh May 21 '15 at 00:51
  • You're saving some time but I'm afraid that it's still n^2. You can add a counter to tell you exactly how many iteration for each sample. If it's n or 2n, let us share. – Tim3880 May 21 '15 at 00:53
  • Why are you asking about `n!` now? That'd be way worse than `n^2`. – sepp2k May 21 '15 at 01:27
  • @Tim3880: no you can use the pigeonhole argument. If the outer `for` loop has iterated *i* times, we know there are at most *255-i* different characters in the remaining string. From the moment we have visited the 256-th character, we know there is a duplicate. Every string of length `256` has at least one duplicated char. – Willem Van Onsem May 21 '15 at 01:34
  • @jxh: that's true, if you don't take the pigeonhole argument into account. If you do, you can see the outer loop has at most `255` tries before the algorithm will return something. – Willem Van Onsem May 21 '15 at 01:56
  • 1
    OP: The answer you accepted specifically refers to parts of your question - don't just edit them out. – Barry May 21 '15 at 01:56
  • @CommuSoft: While it is interesting to think about how the practical bounds of the problem space can lead to a technically better big-O, it doesn't actually lead to a more efficient computation of the result. You really need a different algorithm that actually reduces the number of operations. So, using a bit-vector of 2^`CHAR_BIT` bits to track which bytes have been observed solves the problem in a single pass, but it also makes it easy to prove you will not have any more than 2^`CHAR_BIT` +1 iterations before you stop. – jxh May 21 '15 at 01:58
  • Right, this algorithm's Order of Execution complexity is inherently limited by the radix (256) of the datatype is is operating over. This limitation dominates the more normal/usual one implied by the control and data structures (O(n^2)). So instead of being O(n^2), it is actually O(r^2), where r is the radix (256), which because it is a constant for this problems, means it is really o(k) which is also O(1). – RBarryYoung May 21 '15 at 01:58
  • @jxh: you don't need to use a bit-vector: you can stop looping after *|C|* iterations (both inner/outer loop) with *|C|* the number of possible characters. So it runs in *O(|C|)*. But since the size of *|C|* is fixed (even if you would use a different char-set, like unicode, one can assume it is fixed at compile time). So no more than 65k instructions which is 0.0000006 second-ish. – Willem Van Onsem May 21 '15 at 02:11
  • @CommuSoft: I am not arguing that the alphabet bounds the number of iterations, but your approach still ends up visiting parts of the string more than once. If I need to save CPU cycles, your argument about O(1) doesn't matter. – jxh May 21 '15 at 02:12
  • @jhx: yes but the parts can be limited by the first |C| indices as well. And the approach combines both this algorithm and the `256` bound. Except for too much testing (checking both the index bound and the string termination bound), that's the only source of additional cycles. So for small strings, it runs approximately the same as the algorithm above. – Willem Van Onsem May 21 '15 at 02:14
  • 2
    @CommuSoft: I am glad you acknowledge that the single pass algorithm is better. Now, consider that this is a theoretical exercise. The practical problem will have `char` replaced by `std::string`, or worse, `Foobar`. – jxh May 21 '15 at 02:20
  • @jhx: indeed. As stated before: if the number of possible values is fixed, it cascades automatically to *O(n)*, and you don't have to perform any bound checks (or know the number of elements in advance). If you know the number of elements, you can make it O(1). And if the number of elements is infinite, it is O(n^2). – Willem Van Onsem May 21 '15 at 02:22
  • @CommuSoft: Again, even assuming I know that no string will be longer than 4 billion bytes, it doesn't help the actual problem of reducing the number of CPU cycles knowing the multi-pass implementation is still O(1). – jxh May 21 '15 at 02:27
  • @jxh: I think you confuses the **length of the string** with the **size of the alphabet** (which was implied by *the number of elements*). If you have a machine with an infinite amount of memory and you have a string of let's say `10^1000` in your program. If you know your string can contain at most `4` different characters, this algorithm will stop in `40^1000` steps (not `10^2000`). And a smarter algorithm can stop in `16` steps). The *number of elements* was about the size of the *alphabet* (`256` for ASCII, which is very reasonable), not *the size of the string*. – Willem Van Onsem May 21 '15 at 02:33
  • 2
    @CommuSoft: I am talking about the extended practical problem of identifying duplicate strings in an array of strings. – jxh May 21 '15 at 02:34
  • @jxh: in that case if you assume there are no bounds on the length of strings, it is *O(n^2)*. Otherwise it is an *impractical* O(1). So you're correct on that point. But for `char`s that's not the case. If it was to compare arrays of `T` with `T` a generic parameter, *O(n^2)* is indeed the answer. – Willem Van Onsem May 21 '15 at 02:36
  • @jxh: and I hope you agree, this *extended* practical problem is not the scope of this question. – Willem Van Onsem May 21 '15 at 03:12
  • 1
    @CommuSoft: Who's to say? – jxh May 21 '15 at 03:25
  • @jxh: that's of course a problematic aspect to judge. But (a) `char`s in C/C++ are finite (even if we assume the machine has infinite memory and pointers can point every place in that infinite memory); (b) in theoretical computer science, nearly all alphabets on strings are considered finite as well. So if we interpret this question very strictly this is not the extended problem. If we drop all assumptions one can make on characters, it is indeed the extended version. – Willem Van Onsem May 21 '15 at 03:30

5 Answers5

22

In addition to other answers, I'd like to point out that the problem can be solved in O(1) without additional memory, and without modifying the contents of the input string.

First, do strnlen(str, 256). If you get more than 255, return 0. By the pigeonhole principle, some character must occur more than once. This operation takes only O(1) since we examine only a bounded prefix of the string.

Now, the string is shorter than a constant (256), so use any naive algorithm to finish in only O(1) additional time.

Atsby
  • 2,277
  • 12
  • 14
  • Upvoted. I noted the same in my answer at roughly the same time as you. You can do better than O(n), since you can write your own strlen() that stops once it gets to 256 since you don't need to know the precise length.. only that it's > 255. – Paul Hankin May 21 '15 at 01:10
  • 1
    @Barry: in case there are only 256 *characters* and by using an bitvector of length `256`, you can: because of the Pigeonhole argument. You know that if there is a duplicate character, this occurs at most after `256` positions: indeed, you can fill the first `256` with different chars, but eventually you run out of unique chars and so the duplicate occurs. From that moment, you can stop. – Willem Van Onsem May 21 '15 at 01:16
  • 1
    @CommuSoft Correct up to the fact that '\0' is not a valid character in a C string, so you only have 255 different characters. – Atsby May 21 '15 at 01:18
  • @Atsby: well that makes the argument only stronger. And it of course depends a bit whether you follow the "standard" that `\0` is a terminating character. But even if you see `NUL` as a possible character (and for instance use a encoding (like arrays do), the reasoning remains valid. – Willem Van Onsem May 21 '15 at 01:19
  • 2
    Instead of writing your own strlen, use char *p = memchr(str, 0, 256); if(p == NULL)notunique = true; else len = p-str; ... – rcgldr May 21 '15 at 03:08
  • @rcgldr Fair point, given that `strnlen` is a GNU-ism. Hadn't realized that. **EDIT:** Actually, it is in POSIX, and marked CX "extension to the ISO C standard" so I would guess fairly portable. – Atsby May 21 '15 at 03:27
  • @Atsby - strnlen() and memchr() should be similar code, but in the case of Microsoft compilers, memchr() is optimized with clever code to check 4 bytes at a time. – rcgldr May 21 '15 at 04:14
  • 2
    @rcgldr Nice, but who cares what Microsoft compilers do? – Atsby May 21 '15 at 04:16
  • @Atsby - I'm wondering if other compilers also have the same issue, an optimized memchr() versus a non optimized strnlen(). – rcgldr May 21 '15 at 04:19
  • @rcgldr On linux, gcc 4.8.2, strnlen seems to run at about 0.95 the speed of memchr. So they are either equally good or equally bad. – Atsby May 21 '15 at 04:36
  • @rcgldr Looking at the disassembly of libc.so.6 on linux, it appears that they both use mmx, so I would guess they are both equally good. – Atsby May 21 '15 at 04:42
19

No, it's still O(n^2). You just slightly improved the constant. You still have to make two loops- basically the naive count the loops way of measuring big O time should tell you this.

Also, there is no such thing as O(n+1/2n). Big O notation is to give you an idea of the order of magnitude something should take. n+1/2n= 1.5n. Since big O drops all constant factors, that would just be n.

You can beat O(n^2) without extra memory though. If nothing else you can sort the strings by ascii value (nlog(n) time) and then walk the array looking for dupes (n time) for O(n+nlogn)=O(nlogn) time. There's probably other tricks as well.

Note that the sorting approach may not give better runtime though- the naive way has a best case runtime of 1, while a sort first algorithm has to sort, so it has a best case of nlogn. So best big-O time may not be the best choice.

Alfe
  • 56,346
  • 20
  • 107
  • 159
Gabe Sechan
  • 90,003
  • 9
  • 87
  • 127
  • 1
    Your last remarks are incorrect. The best case for the naive algorithm is still O(n) and the best case for sorting is O(n) too. O(nlogn) is only the worst (and expected) case lower bound, but a sorting algorithm *can* take O(n) time to sort *some* inputs (e.g. insertion sort or tim sort). Moreover, you should mention that not all sorting algorithms can achieve O(nlogn) performance with no extra memory. For example quicksort fails in this regard (requires O(log n) memory for the call stack), while heapsort works. – Bakuriu May 21 '15 at 14:20
  • 1
    @Bakuriu the naïve approach has a best case runtime of 1 because if the first two characters match, the length of the string has no impact on the runtime. – IronMensan May 21 '15 at 14:58
  • @IronMensan Sorry, that's true. However the part that troubles me is the references to sorting complexities, – Bakuriu May 21 '15 at 16:33
11

Short answer

If the number of possible characters (not to be confused with the length of the strings) is not fixed (not the case here) the time complexity of your algorithm is O(n^2). If we make the assumption there are only a fixed number of valid characters (in this case 255/4G), your algorithm runs in worst-case O(n). If the condition holds, the algorithm can then easily be improved to run in O(1).

Note on asymptotic behavior and big-oh: these are theoretical results. It's not because an algorithm runs in O(1), it runs in reasonable time. It means it runs in constant time. So that - asymptotically speaking - it won't make any difference whether you enter a string with length 101000 or one with length 1010'000 (given these lengths are large enough). The time it takes can be more than one hundred times the age of the universe as well.

Explanation

You can do a simple more than worst-case analysis on the for loops:

for (; *scout != '\0'; ++scout, ++offset)
    for (start = (char *)str + offset; *start != '\0'; ++start)
        //single statement

Now we want to know how many times the single statement (it contains a fixed number of statements) will be repeated. Since you never modify the content of the string. We know that there is an index n at which the value is \0.

So we can rewrite it as:

for (scout = 0,offset = 0; scout < n; ++scout, ++offset)
    for (start = offset; *start < n; ++start)
        //single statement

(I've assumed the string starts at memory address 0), but since that's only a shift, that allowed, it makes it only simpler to reason about this.

Now we're going to calculate the number of statements in the inner for loop (parameterized). That's equal to:

T inner = n-o

With o the offset and n the length of the string.

Now we can use this formula to calculate the number of instructions at the outer for-loop level. Here o starts with 0 and iterates through (excluding) n. So the total number of instructions is:

T outer is n square plus n divided by two

Which is O(n^2).

But now one has to ask: is it possible to construct such a string? The answer is no! There are only 255 valid characters (the NUL character is not considered to be a character); if we cannot make this assumption the above holds. Say the first character is an a (with a an arbitrary char), then it either matches with another a in the string, which can be resolved in O(n) time (loop through the remainder of the string); or it means that all other characters are different from a. In the first case, the algorithm terminates in O(n); in the second case, this means that the second character is different.

Let's say the second character is b. Then we again iterate over the string in O(n) and if it finds another b we terminate, after 2n or O(n) steps. If not, we need try to find a match for the next character c.

The point is that we only need to do this at most 255 times: because there are only 255 valid characters. As a result the time complexity is 255n or O(n).

Alternative explanation

Another variant of this explanation is "if the outer for loop is looking for the i-th character we know that all characters on the left of i, are different from that character (otherwise we would have already rejected earlier)." Now since there are only 255 characters and all characters on the left are different from each other and the current character, we know that for the 256-th character of the string, we cannot find a new different character anymore, because there are no such characters.

Example

Say you have an alphabet with 3 characters (a,b and c) - this only to make it easier to understand the matter. Now say we have a string:

scout
v
b a a a a a b
  ^
  start

It is clear that your algorithm will use O(n) steps: the start will simply iterate over the entire string once, reach b and return.

Now say there is no duplicate of b in the string. In that case the algorithm does not stop after iterating over the string once. But this implies all the other characters should differ from a (after all we've iterated over the string, and didn't find a duplicate). So now consider a string with that condition:

scout
v
b a a a a a a
  ^
  start

Now it is clear that a first attempt to find a character b in the remainder of the string will fail. Now your algorithm increments the scout:

  scout
  v
b a a a a a a
    ^
    start

And starts searching for a. Here we're very lucky: the first character is an a. But if there is a duplicate a; it would cost at most two iterations, so O(2n) to find the duplicate.

Now we're reaching the bound case: there is no a either. In that case, we know the string must begin with

    scout
    v
b a c ...

We furthermore know that the remainder of the string cannot contain b's nor a's because otherwise the scout would never have moved that far. The only remaining possibility is that the remainder of the string contains c's. So the string reads:

    scout
    v
b a c c c c c c c c c c
      ^
      start

This means that after iterating over the string at most 3 times, we will find such duplicate, regardless of the size of the string, or how the characters are distributed among that string.

Modify this to O(1) time

You can easily modify this algorithm to let it run in O(1) time: simply place additional bounds on the indices:

int strunique(const char *str)
{
    size_t offset = 1;
    char *scout = (char *)str, *start, *stop = scout+255;

    for (; scout < stop && *scout != '\0'; ++scout, ++offset)
        for (start = (char *)str + offset; start <= stop && *start != '\0'; ++start)
            if (*start == *scout)
                return 0;

    return 1;
}

In this case we've bounded the first loop such that it visits at most the first 255 characters, the inner loop visits only the first 256 (notice the <= instead of <). So the total number of steps is bounded by 255 x 256 or O(1). The explanation above already demonstrates why this is sufficient.

Note: In case this is C, you need to replace 255 by 4'294'967'296 which makes it indeed theoretically O(n), but practically O(n^2) in the sense that the constant factor before the n is that huge for O(n) that O(n^2) will outperform it.

Since we combine the string termination check with the 256-check this algorithm will run nearly always faster than the one proposed above. The only source of potentially extra cycles is the additional testing that ships with the modified for-loops. But since these lead to faster termination, it will in many cases not result in additional time.

Big-oh

One can say: "Yes that's true for strings with length greater than or equal to 256 characters.", "What about strings with a size less than 256?". The point is that big-oh analysis deals with asymptotic behavior. Even if the behavior was super-exponential for some strings less than or equal to a certain length, you don't take these into account.

To emphasize the (problematic) aspect of asymptotic behavior more. One can say that the following algorithm would be correct asymptotically speaking:

int strunique(const char *str) {
    return 0;
}

It always returns false; because "There is a length n0 such that for every input length n > n0 this algorithm will return the correct result." This has not much to do with big-oh itself, it's more to illustrate that one must be careful with saying an algorithm running in O(1) will outperform an algorithm in O(n^6) for any reasonable input. Sometimes the constant factor can be gigantic.

Community
  • 1
  • 1
Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
  • Commenting to learn myself not to critique. I always thought the `O(1)` is constant as in same every time for any input. If you know the bounds even then your best and worst cases are different. I tend to think it is still `O(n^2)`. What if bounds were `2550000000 x 256000000`? – Shahzeb May 21 '15 at 01:46
  • 3
    I think most people forget an important aspect: Big-oh means asymptotic behavior: *O(f(n))* means "there exists an input length N0 and a value *a* for which every string of input greater than or equal to N0 (thus |X|>N0) the number of steps T is less than or equal to *a f(|X|)*". And yes even `2.5G x 2.5G` is *O(1)*. But here the bounds are more reasonable. – Willem Van Onsem May 21 '15 at 01:52
  • 1
    @Shahzeb What's your n? CommuSoft's algorithm is O(1) in the length of the input string. Which is the metric you usually use. It's not O(1) in the size of the alphabet - but he makes the explicit disclaimer that he considers the alphabet fixed. A fixed alphabet is not an uncommon or unreasonable assumption to make. The algorithm is O(m^2), where m is the size of the alphabet. – Taemyr May 21 '15 at 07:24
  • Your answer is cute but the OP asked if his/her algorithm is O(n). I think it is still O(n^2). I don't see where he is checking boundaries, can you? We are not talking what's the best run time of the best solution but the run time of this very algorithm he presented. Yes he/she can do better but not without further trying. – Tim3880 May 21 '15 at 12:33
  • @Tim3880: as said before. You don't have to. Let's assume we take ASCII as alphabet. That alphabet contains `255` valid characters (`\0` is not a valid one). Now if the `scout` gets as far as let's say `4` characters: we know the first `3` only occurred once in the string. Otherwise we would have returned `0` earlier. So that means the string reads something like `abcde...`. But if we reach the `255`-th and the string is longer. We know it will return `0`: simply because we know the first characters were different, and there are only `255` characters, we know there is at least one duplicate. – Willem Van Onsem May 21 '15 at 12:47
  • The algorithm solves this implicitly by returning `0` from the moment it finds a duplicate. Although it is not an *intelligent* algorithm; it benefits from the pigeonhole argument as a side-effect. – Willem Van Onsem May 21 '15 at 12:48
5

Your algorithm is O(N^2). This is easy to very by simply noting that in the worst case, a string with all unique characters, every character must be checked against every character that comes after it. That is, worst-case, N*(N-1)/2 = O(N^2) comparisons.

Note that by definition:

f(x) = O(g(x))

if there exists some constant such that |f(x)| <= M|g(x)| for all sufficiently large x. So when you say f(x) = O(n + 1/2n) (which is incorrect for your algorithm), that follows that:

  f(x) = O(n + 1/2n)
  f(x) <= M * (n + 1/2n) for some M, n_0 for n >= n_0, by definition
  f(x) <= (3/2 * M) n, n >= n_0
  f(x) <= M' n, setting M' = 3/2 M, n >= n_0
  f(x) = O(n), by definition

That is, constants always drop out, which is why you might hear the expression that constants don't matter (at least as far as calculating runtime complexity goes - obviously they matter for actual performance)

Barry
  • 286,269
  • 29
  • 621
  • 977
  • 3
    If you do an analysis without using any properties of strings, you're right. But the problem is: you cannot construct such a string. Every string with length greater than `255` has at least one duplicate character. It means that after at most `255` attempts in the outer for-loop, you will have found that character. So after *O(255n)=O(n)*, your algorithm will return `0`. For strings less than `255`, we're talking about a constant. – Willem Van Onsem May 21 '15 at 01:49
  • @CommuSoft That's really not the salient feature of the question. – Barry May 21 '15 at 01:54
  • 1
    Well if the *alphabet* is fixed, then the algorithm is bounded by *O(n)*; even if you don't know the size of the alphabet before. Despites the algorithms *stupidity*, it will always does at most `|A|` iterations in the outer for loop, with |A| the number of possible characters. – Willem Van Onsem May 21 '15 at 01:58
  • If it starts the *i*-th iteration, it nows all the previous characters are different from the current. After |A| iterations, you've run out of different characters that can be placed before the current one not *duplicating*. – Willem Van Onsem May 21 '15 at 02:01
  • @CommuSoft Yes, you can make the argument that literally any algorithm is `O(1)` because there is some maximum number of possible comparisons across the universe of all possible strings. However, that doesn't really answer OP's question, or address OP's misconception of what big-Oh notation means - so no, I do not believe that's really relevant to *answering the question*. – Barry May 21 '15 at 02:02
  • no I'm not using *that* argument*. Let me make it more practical. Say you have a string with length *n* and there are at most `3` characters. You will make at most *3n* steps. The worst case string is `abc...c`. I first read the first character *a* and I start iterating over the string; thus *O(n)* times. And it doesn't find an *a*. Now I'm going to iterate over the second char (`b`), again no luck, wasted *O(n)*. But now `c` is in scope. And since the characters on the right cannot be `a`'s or `b`'s (would be filtered out earlier). I know there will be `c`s on the right side. – Willem Van Onsem May 21 '15 at 02:05
  • As a challenge: can you provide me a single string, where the **`scout`** needs more than `255` steps before termination. The `start` must indeed perform *O(n)* iterations. It is impossible because of the pigeonhole argument. Even if you would have `65k` chars (although this will blow up the constant factor of *O(n)*). The nice thing is that the algorithm is not aware, but still runs faster. – Willem Van Onsem May 21 '15 at 02:07
  • I'm not interested in upvotes at all. I've only saying that given the charset is fixed (at compile time), the algorithm runs in O(A*N) with A the number of characters (fixed) and N the length of the string. Please provide a string to disprove my theory. You can make the charset smaller to make it more reasonable. – Willem Van Onsem May 21 '15 at 02:12
3

A string with all unique characters can have length at most 255. In that case, your algorithm runs in O(1) time.

If the string contains duplicate characters, one of those duplicate characters appears in the first 255 elements of the string. Then the worst case is when the first 254 characters of the string are unique and the 255th character is repeated until the end of the string. Then your algorithm runs in O(N) time.

You can guarantee O(1) time for your algorithm by first checking if the length of the string is greater than 255 and failing immediately if so.

All that's assuming that char takes one of 256 values. If you treat the number of characters in char as a variable C then the complexities for your algorithm are O(C^2) in the case the string contains only unique characters, O(NC) in the case that the string contains duplicates, and you can guarantee O(C^2) time by first checking the length of the string isn't greater than C.

The optimal algorithm is O(min(N, C)) by first checking the string isn't longer than C, and then using any linear-time duplicate detection algorithm.

Paul Hankin
  • 54,811
  • 11
  • 92
  • 118
  • 1
    Nice explanation. I didn't think of the O(1) improvement until after I posted, but I updated before I noticed your answer. – Atsby May 21 '15 at 01:14