What is the most efficient way to detect duplicate characters in a String in Java?

Question

Using data structures (HashMap) I was able to do it.

This is the code:

import java.util.*;

class unique{
    public static void main(String[] args){
        HashMap<Character, Integer> charMap = new HashMap<Character, Integer>();
        boolean isNotUnique = false;
            for ( char loop : args[0].toCharArray() ){
            Integer freq = charMap.get( loop );
            charMap.put( loop, ( freq == null )? 1 : freq+1 );
            if ( charMap.get( loop ) > 1 )
            {
                isNotUnique = true;
            }
        }
            System.out.println ( isNotUnique );
    }
}

Without datastructures, I came up with a blunt approach. This has O(n^2)

class unique
{
    public static void main(String[] args)
    {
        String inputString = args[0];
        System.out.println( isUnique( inputString ) );

    }

    private static boolean isUnique(String inputString) {
        String methodString = inputString;
        for ( int i = 0; i < inputString.length(); i++ )
        {
            for ( int j = i+1; j < inputString.length(); j++ )
            {
                if ( methodString.charAt( i ) == methodString.charAt( j ) )
                {
                    return false;
                }
            }
        }
        return true;
    }
}

I was wondering if it was possible to solve in O(n) time complexity

score 1 · Answer 1 · answered Mar 28 '11 at 02:53

1

What is the definition of character? a-z A-Z or all unicode?

if the length of string is quite big, such as one million, you could build an int array, the length of array is the length of you character set and the array would be initialized with zero.

after this, go through the string, according to each charactor : array[(int)char]++, so that, u can easily find the time of character appearing from the array.

however, short string won't need such method.

this method is O(n)

answered Mar 28 '11 at 02:53

wingnet

128
5

I'd be interested to know how much slower a regular expression would be. – BudgieInWA Mar 28 '11 at 03:10
not familiar with regular expressions. I guess the efficiency of regular expressions depends on the specific code of it, we can't do a common estimation of whole regular expression. – wingnet Mar 28 '11 at 03:18
I believe regexp would most probably yeild O(N*N), or O(N*logN). – Vladimir Dyuzhev Mar 28 '11 at 03:29
@Vladimir, @wingnet According to [this](http://stackoverflow.com/questions/21669/complexity-of-regex-substitution), regex's run through a string of length n in O(n) time (after being compiled). If that's the case, you could be lazy and use `(.).*?\1` to find a repeated char (but that is a pretty boring way to do it). – BudgieInWA Mar 28 '11 at 14:02
@BudgieInWA it's helpful to know this. – wingnet Mar 29 '11 at 01:52
This is probably true, but to find duplicates (not a predefined pattern) regexp would have to run O(N) operation for every single char in the string, which gives O(N*N). Or so I think. It would be an interesting experiment. Any takers? – Vladimir Dyuzhev Mar 29 '11 at 02:30

WhiteFang34 · Accepted Answer · 2011-03-28T03:32:50.413

1

If you need to support Unicode characters that aren't represented by surrogate char pairs, this will do it:

private static boolean isUnique(String inputString) {
    long[] used = new long[1024];
    for (char c : inputString.toCharArray()) {
        if ((used[c >>> 6] & (1 << c)) > 0) {
            return false;
        }
        used[c >>> 6] |= 1 << c;
    }
    return true;
}

It's using bit flips to save memory. It's essentially the same thing as if you used an array of booleans:

private static boolean isUnique2(String inputString) {
    boolean[] used = new boolean[65536];
    for (char c : inputString.toCharArray()) {
        if (used[c]) {
            return false;
        }
        used[c] = true;
    }
    return true;
}

If you only need to support ASCII characters you could limit the size of used in either case to reduce the memory required (so long[4] and boolean[256]). Below a certain length of inputString it's probably faster to do the n^2 check than allocate the memory for this though. So ideally you do a combination of the two based on the length.

If you need to support all possible Unicode characters you'll have to modify this to support surrogate char pairs. You can detect them with Character.isHighSurrogate(c). See this page for some help and search Google for more details.

edited Mar 28 '11 at 03:32

answered Mar 28 '11 at 02:58

WhiteFang34

70,765
18
106
111

UTF-16 (currently in use by Java) can produce up to 1,112,064 symbols, not 65K, due to surrogates. – Vladimir Dyuzhev Mar 28 '11 at 03:09
@Vladimir: good point, I updated my answer to specify that it only supports Unicode characters up to `\uFFFF` that can be returned through the primitive `char`. It still yields the same result as the example provided by the OP, which checks per `char` as well. – WhiteFang34 Mar 28 '11 at 03:20
I checked your code... it runs fine... I am not able to understand how if (used[c]) { return false; } used[c] = true; works – Mar 28 '11 at 03:21
Sorry, still has to disagree. Surrogates consist of two chars. When you take both and map into array, you end up with two counters, both of which are wrong. You'd need to use Character.isHighSurrogate(c) to avoid it. – Vladimir Dyuzhev Mar 28 '11 at 03:26
@redmave, used[c] is essentially equivalent to (charMap.get(c) > 0) in your hashtable solution. – Zach Langley Mar 28 '11 at 03:38
Updated again to clarify on limitation and details about surrogate char pairs. Probably want to go with a `Set` implementation to support that since there are so many possible characters. – WhiteFang34 Mar 28 '11 at 03:40
Can you explain what right unsigned shifting a character with 6 means (c >>> 6)? – yesh Jan 31 '18 at 23:03

Stephen C · Answer 3 · 2011-03-29T02:46:52.110

I was wondering if it was possible to solve in O(n) time complexity:

There are two simple solutions that are O(N) in time:

The HashSet approach is O(N) in time and O(N) in space, where N is the String length. (A regular Java HashSet containing N distinct characters will take O(N) space, with a relatively large constant of proportionality.)
The bit array approach is O(N) in time and O(1) in space, but the O(1) is 8K bytes (or 64K bytes if you use a boolean[]) and there is an associated cost of zeroing that much memory added to the time.

Neither of these is the best answer in all cases.

For small enough N, a simple O(N^2) nested loop will be be fastest. (And it uses no extra memory.)
For medium-sized N, a custom hash table that uses rehashing on collision will be better than HashSet or the bit array approach. The HashSet approach will be better than the bit array approach.
For large enough N, the bit array approach will be fastest and use the least memory.

(For the above, I'm assuming that input strings don't contain any duplicate characters. If they do, then the actual value of N will be less than the string length.)

If duplicate character detection needs to cope with UTF-16 surrogate pairs, then the simple approach is to transcode on the fly to Unicode codepoints, and change the data structures to use a HashSet<Integer>, bigger bit arrays, etc.

Conversely, if you can limit the size of the input character set, you can reduce the size of the bit arrays, etc.

These tweaks would make a big difference to where the small / medium / large thresholds are likely to fall.

What is the most efficient way to detect duplicate characters in a String in Java?

3 Answers3