1

You are given a string. Develop a function to remove duplicate characters from that string. String could be of any length. Your algorithm must be in space. If you wish you can use constant size extra space which is not dependent any how on string size. Your algorithm must be of complexity of O(n).

My idea was to define an integer array of size of 26 where 0th index would correspond to the letter a and the 25th index for the letter z and initialize all the elements to 0. Thus we will travel the entire string once and and would increment the value at the desired index as and when we encounter a letter.

and then we will travel the string once again and if the value at the desired index is 1 we print out the letter otherwise we do not.

In this way the time complexity is O(n) and the space used is constant irrespective of the length of the string!!

if anyone can come up with ideas of better efficiency,it will be very helpful!!

Poulami
  • 1,047
  • 3
  • 13
  • 27
  • guess this is a homework :P any programming language constraint? or pseudo-code is fine? For example if it were PHP you could append to an array and the size would be variable, only using up to as many characters as you have traveled and checking if a char has been checked is a matter of an in_array (for example) – Purefan Aug 05 '11 at 06:42
  • 1
    I don't think assuming the string contains only `a` to `z` is fair. There are over one million codepoints in Unicode. – Ray Toal Aug 05 '11 at 06:43
  • 1
    @purefan this is not a homework question as you can already see i hav tagged it as an interview qs.. – Poulami Aug 05 '11 at 06:46
  • @Ray Toal Ya i will be grateful if you can come up with a better soln whc will include all the codepoints – Poulami Aug 05 '11 at 06:47
  • 1
    Okay I will give it a shot but first, what is meant by repeated? Do you mean any character that appears more than once anywhere in the string, or any contiguous runs of two or more occurrences of the same character. The latter is what is removed in the Unix `uniq` program. – Ray Toal Aug 05 '11 at 06:52
  • @Ray Toal Any character that appears more than once eg BANANAS output would be BANS – Poulami Aug 05 '11 at 06:54
  • Forgot to check for dup questions. On SO: http://stackoverflow.com/questions/2039223/how-can-you-efficiently-remove-duplicate-characters-from-a-string – Ray Toal Aug 05 '11 at 06:55
  • Does preserving the string order matter, or are you just trying to get all the unique characters? IE 'aabc' does the output have to be 'abc' or should it just be the unique characters in any order? – Nicholas Aug 05 '11 at 16:14
  • yes preserving the string order matters – Poulami Aug 05 '11 at 17:50

2 Answers2

1

Your solution definitely fits the criteria of O(n) time. Instead of an array, which would be very, very large if the allowed alphabet is large (Unicode has over a million characters), you could use a plain hash. Here is your algorithm in (unoptimized!) Ruby:

def undup(s) 
  seen = Hash.new(0)
  s.each_char {|c| seen[c] += 1}
  result = ""
  s.each_char {|c| result << c if seen[c] == 1}
  result
end

puts(undup "")
puts(undup "abc")
puts(undup "Olé")
puts(undup "asdasjhdfasjhdfasbfdasdfaghsfdahgsdfahgsdfhgt")

It makes two passes through the string, and since hash lookup is less than linear, you're good.

You can say the Hashtable (like your array) uses constant space, albeit large, because it is bounded above by the size of the alphabet. Even if the size of the alphabet is larger than that of the string, it still counts as constant space.

There are many variations to this problem, many of which are fun. To do it truly in place, you can sort first; this gives O(n log n). There are variations on merge sort where you ignore dups during the merge. In fact, this "no external hashtable" restriction appears in Algorithm: efficient way to remove duplicate integers from an array (also tagged interview question).

Another common interview question starts with a simple string, then they say, okay now a million character string, okay now a string with 100 billion characters, and so on. Things get very interesting when you start considering Big Data.

Anyway, your idea is pretty good. It can generally be tweaked as follows: Use a set, not a dictionary. Go trough the string. For each character, if it is not in the set, add it. If it is, delete it. Sets take up less space, don't need counters, and can be implemented as bitsets if the alphabet is small, and this algorithm does not need two passes.

Python implementation: http://code.activestate.com/recipes/52560-remove-duplicates-from-a-sequence/

Community
  • 1
  • 1
Ray Toal
  • 86,166
  • 18
  • 182
  • 232
0

You can also use a bitset instead of the additional array to keep track of found chars. Depending on which characters (a-z or more) are allowed you size the bitset accordingly. This requires less space than an integer array.

murrekatt
  • 5,961
  • 5
  • 39
  • 63