17

So I'm presented with a problem that states. "Determine if a string contains all unique characters"

So I wrote up this solution that adds each character to a set, but if the character already exists it returns false.

private static boolean allUniqueCharacters(String s) {

    Set<Character> charSet = new HashSet<Character>();
    for (int i = 0; i < s.length(); i++) {
        char currentChar = s.charAt(i);
        if (!charSet.contains(currentChar)) {
            charSet.add(currentChar);

        } else {
            return false;
        }

    }
    return true;

}

According to the book I am reading this is the "optimal solution"

public static boolean isUniqueChars2(String str) {
    if (str.length() > 128)
        return false;

    boolean[] char_set = new boolean[128];

    for (int i = 0; i < str.length(); i++) {
        int val = str.charAt(i);

        if (char_set[val]) {
            return false;
        }
        char_set[val] = true;
    }

    return true;
}

My question is, is my implementation slower than the one presented? I assume it is, but if a Hash look up is O(1) wouldn't they be the same complexity?

Thank you.

OldCurmudgeon
  • 64,482
  • 16
  • 119
  • 213
fsdff
  • 611
  • 4
  • 10
  • 15
    Well, the book "solution" fails in Java because characters are 16 bits and can have a value from `0` to `65535`, so a 128-byte array won't work. – Jim Garrison Aug 21 '18 at 07:43
  • 22
    They are of same _complexity_. Not necessarily of same _speed_. Carrying 10 vases up a staircase has the same _complexity_ as carrying 10 pianos up a staircase, but the effort involved is somewhat different. Hash O(1) is slower than array O(1), given that hash needs to, at least, calculate the hash. Also, hash is _amortised_ O(1), not true O(1); it does not matter complexity-wise, but it does affect speed. – Amadan Aug 21 '18 at 07:43
  • @Amadan `HashSet` is backed by a hash table and insertion to a hash table can be O(n) in the worst case, right? So the first one has time complexity O(n^2) isn't it? – Sweeper Aug 21 '18 at 07:45
  • @Sweeper: It's not a random worst case, like in qsort; the worst case comes rather predictably, which is why we say it has [amortised complexity](https://en.wikipedia.org/wiki/Amortized_analysis) of O(1). – Amadan Aug 21 '18 at 07:47
  • @YvesDaoust: You are, of course, correct, and I suspect we were talking about different things. Typically when talking about hashes you'd discount collisions, on the presupposition that the hash function is uniform and the hash table is spacious enough. While it is certainly possible to craft such examples where everything collides, sorting already-sorted arrays tends to happen quite a bit more often in comparison. But I was actually thinking more of resizing and consequent rehashing as the population grows, which is predictable, constant, and thus comfortably amortised to O(1). – Amadan Aug 21 '18 at 08:37
  • @YvesDaoust: As I said, we're talking about different things. Under assumption of perfect hashing, insert is technically O(N), because it might trigger resize. It is said to be amortised O(1), because we can ignore it, as resize happens predictably as N grows. I was *not* talking, or thinking about hash collisions. I probably should have, but I was not. – Amadan Aug 21 '18 at 08:47
  • 5
    Apples and oranges. Your implementation is more general since it doesn't assume this is pure ascii. One performance improvement you can do is getting rid of the contains operation and simply comparing the size of the set and the string. As soon as they start to differ, return false. – Jilles van Gurp Aug 21 '18 at 08:54
  • The book says that a solution is optimal doesn't mean that other solutions are not optimal. – user202729 Aug 21 '18 at 09:27
  • 7
    I suggest throwing the book away if the author thinks there are only 128 characters – phuclv Aug 21 '18 at 09:33
  • 3
    To be fair your algorithm calculates the hash twice for the contains() and the add(). The add() method returns whether the element was already contained so no need for the extra contains() check. – Nathan Adams Aug 21 '18 at 09:43
  • 1
    HashSet.add returns a boolean indicating whether the character was already in the set, so you don't need to call contains before adding it, and you can write with a stream `boolean allUniqueCharacters(String s) { HashSet set = new HashSet(); return s.chars().allMatch(x -> set.add(x))); }` (crossposted while I was looking up stream syntax) – Pete Kirkham Aug 21 '18 at 09:54
  • @phuclv and both functions are meaningless for a large swathe of the BMP – Caleth Aug 21 '18 at 13:38
  • ack!, I almost accidentally downvoted your question because I’m annoyed at the book. – Konrad Rudolph Aug 21 '18 at 13:54
  • 1
    Out of curiosity, how would a regex like `(.).*\1` compare against the OP's existing methods? – ctwheels Aug 21 '18 at 14:14
  • 1
    If you assume there are at most 128 different chars in the string, the complexity of both algorithms is `O(1)`, since if there are no duplicates up to the 128th char of a string, the 129th must be a duplicate... – fabian Aug 21 '18 at 15:13
  • @ctwheels I want to know as well. – Sweeper Aug 21 '18 at 16:30

6 Answers6

12

As Amadan said in the comments, the two solutions have the same time complexity O(n) because you have a for loop looping through the string, and you do constant time operations in the for loop. This means that the time it takes to run your methods increases linearly with the length of the string.

Note that time complexity is all about how the time it takes changes when you change the size of the input. It's not about how fast it is with data of the same size.

For the same string, the "optimal" solution should be faster because sets have some overheads over arrays. Handling arrays is faster than handling sets. However, to actually make the "optimal" solution work, you would need an array of length 2^16. That is how many different char values there are. You would also need to remove the check for a string longer than 128.

This is one of the many examples of the tradeoff between space and time. If you want it to go faster, you need more space. If you want to save space, you have to go slower.

Sweeper
  • 213,210
  • 22
  • 193
  • 313
3

Both algorithms have time complexity of O(N). The difference is in their space complexity.

The book's solution will always require storage for 128 characters - O(1), while your solution's space requirement will vary linearly according to the input - O(N).

The book's space requirement is based on an assumed character set with 128 characters. But this may be rather problematic (and not scalable) given the likelihood of needing different character sets.

ernest_k
  • 44,416
  • 5
  • 53
  • 99
  • in an interview would my solution be unacceptable or trivial? – fsdff Aug 21 '18 at 07:54
  • @fsdff that will depend entirely on what the interviewer expects, so it's very subjective. But your solution seems very normal/common to me. – ernest_k Aug 21 '18 at 07:57
  • I don't see how the OP's solution would be O(N). When all possible keys have been entered in the hashmap, it will not grow anymore. The required space is O(1) as for the array. –  Aug 21 '18 at 08:23
  • @YvesDaoust You're assuming some character set, that makes sense... except that the point of complexity analysis is getting an idea of worst-case scenarios. And whether the character set is finite or not, it's a fact in this case that space complexity depends on input, and no explicit mention was made about the size of the character set. – ernest_k Aug 21 '18 at 08:30
  • @ernest_k: whatever the alphabet size, let M, there is no difference between the space complexities: both O(M) (or O(M Log M) if you want to be picky), contrary to what you say. And no dependency on the size of the input. –  Aug 21 '18 at 08:33
  • @YvesDaoust Rather that is more accurate. But `M` and `N` are defined by the same input of the function, and I suppose that's the point I'm trying to make. – ernest_k Aug 21 '18 at 08:41
  • @ernest_k: what do you mean ? `M` is the alphabet size and `N` the input size. With all rigor, we have O(M Log M) space complexity for this algorithm. –  Aug 21 '18 at 08:43
2

The hashmap is in theory acceptable, but is a waste.

A hashmap is built over an array (so it is certainly more costly than an array), and collision resolution requires extra space (at least the double of the number of elements). In addition, any access requires the computation of the hash and possibly the resolution of collisions.

This adds a lot of overhead in terms of space and time, compared to a straight array.

Also note that it is kind of folklore that a hash table has an O(1) behavior. The worst case is much poorer, accesses can take up to O(N) time for a table of size N.


As a final remark, the time complexity of this algorithm is O(1) because you conclude false at worse when N>128.

  • While the worst case for *individual* accesses to a hash table is worse than O(1), this is entirely irrelevant here because the algorithm performs Θ(n) accesses, not a single one. Unless a piss-poor hash algorithm or collision avoidance scheme is used, the expected runtime of the algorithm is very much still O(n). And I’m curious how you intend to improve the asymptotic space complexity, which you’ve omitted to explain in your answer. – Konrad Rudolph Aug 21 '18 at 13:58
  • @KonradRudolph: for instance a hash function returning a constant. In this question, there is nothing that can be improved asymptotically. I am just saying that hashing brings no practical benefit, on the opposite. –  Aug 21 '18 at 14:03
  • But this **isn’t** the hash function being used here, and this isn’t how the complexity of the hash table operations are usually (or usefully) calculated. But OK, you say hashing brings no practical benefits; in that case please enlighten me how you’d implement this algorithm for a Unicode alphabet. – Konrad Rudolph Aug 21 '18 at 14:08
  • @KonradRudolph: the question is about ASCII. –  Aug 21 '18 at 14:17
  • It was my impression that the comments under the question established that the question was *not* about ASCII, and that, rather, the answer given in the book is simply wrong. At any rate, ASCII contains 127 valid characters, not 128 so at best the answer in the book is still incompetent. – Konrad Rudolph Aug 21 '18 at 14:21
1

Your algorithm is also O(1). You can think about complexity like how my algorithm will react to the change in amount of elements processed. Therefore O(n) and O(2n) are effectively equal.

People are talking about O notation as growth rate here

amerykanin
  • 255
  • 2
  • 5
  • 15
1

Your solution is could indeed be slower than the book's solution. Firstly, a hash lookup ideally has a constant time lookup. But, the retrieval of the object will not be if there are multiple hash collisions. Secondly, even if it is constant time lookup, there is usually significant overhead involved in executing the hash code function as compared to looking up an element in an array by index. That's why you may want to go with the array lookup. However, if you start to deal with non-ASCII Unicode characters, then you might not want to go with the array approach due to the significant amount of space overhead.

entpnerd
  • 10,049
  • 8
  • 47
  • 68
0

The bottleneck of your implementation is, that a set has a lookup (and insert) complexity* of O(log k), while the array has a lookup complexity in O(1).

This sounds like your algorithm must be much worse. But in fact it is not, as k is bounded by 128 (else the reference implementation would be wrong and produce a out-of-bounds error) and can be treated as a constant. This makes the set lookup O(1) as well with a bit bigger constants than the array lookup.

* assuming a sane implementation as tree or hashmap. The hashmap time complexity is in general not constant, as filling it up needs log(n) resize operations to avoid the increase of collisions which would lead to linear lookup time, see e.g. here and here for answers on stackoverflow.

This article even explains that java 8 by itself converts a hashmap to a binary tree (O(n log n) for the converstion, O(log n) for the lookup) before its lookup time degenerates to O(n) because of too many collisions.

allo
  • 3,955
  • 8
  • 40
  • 71
  • [I moved this discussion into chat](https://chat.stackoverflow.com/rooms/178457/discussion-between-allo-and-konrad-rudolph). – allo Aug 21 '18 at 14:50