0

Given a string, find the longest substring whose characters are contiguous (i.e. they are consecutive letters) but possibly jumbled (i.e. out of order). For example:
Input : "owadcbjkl"
Output: "adcb"
We consider adcb as contiguous as it forms abcd.

(This is an interview question.)

I have an idea of running a while loop with 2 conditions, one that checks for continuous characters using Python's ord and another condition to find the minimum and maximum and check if all the following characters fall in this range.

Is there any way this problem could be solved with low running time complexity? The best I can achieve is O(N^2) where N is the length of the input string and ord() seems to be a slow operation.

Gilles 'SO- stop being evil'
  • 104,111
  • 38
  • 209
  • 254
Sandy
  • 313
  • 1
  • 4
  • 13
  • 3
    The question is too verbose: try to write your algorithm in a Python-like format (pseudo-code) so that we can understand it better. – Don Mar 25 '13 at 13:23
  • This kind of question is off-topic here. You should ask this over at http://codereview.stackexchange.com/ – Ocaso Protal Mar 25 '13 at 13:25
  • Please provide a detailed explanation of what defines a valid substring in your question. You've said that `adcb` is valid, would `adc` be too? – MattH Mar 25 '13 at 13:28
  • no adc is not a valid string since it is not continuous .. another valid string can be "bdca" hope u get the point now – Sandy Mar 25 '13 at 13:43
  • Is it so difficult to mention that by "continuous" you mean the letters appear as a continuous sequence in _the alphabet_? Continuous has no meaning without context. "adcb" or "bdca" is continuous _in your head_ because "abcd" is a continues alphabetic sequence. – Gaminic Mar 25 '13 at 13:47
  • If N is the length of the input, you can't do better than that because you need to at least read the input – xuanji Mar 25 '13 at 13:48
  • continuous to programmers means, continuous within the array...unless otherwise specified. – Stan R. Mar 25 '13 at 13:48
  • it is an interview question and i have not made any changes to it, this is how it was asked .. the point of the question is to look for continuous characters in a set even if they are jumbled.. – Sandy Mar 25 '13 at 13:53
  • 1
    The simple `O(n*m*m)` solution can based on the fact that the longest substring is no longer than `m` where `m` is the alphabet size. If `m` is fixed e.g., `m == 26` then it is a linear time (though ineffective) solution. If `m` is not fixed then [this question might be related](http://stackoverflow.com/questions/177118/algorithm-to-determine-if-array-contains-n-nm). – jfs Mar 25 '13 at 14:18
  • @J.F.Sebastian thats a similar question thanks for the link! – Sandy Mar 25 '13 at 14:32
  • 2
    @OcasoProtal This question is fully on-topic on [so] since it is about implementing [a software algorithm](http://stackoverflow.com/faq#questions). It would be off-topic on [codereview.se] which is only about reviewing working code; I'm puzzled why you would even suggest it. A question about the algorithm itself would be on-topic on [cs.se], but not implementing it in Python: this firmly belongs on [so]. – Gilles 'SO- stop being evil' Mar 25 '13 at 14:51
  • @Gilles take a look at the revisions of this question, especially the second one and you will understand. The current revision is completely different and fits SO. – Ocaso Protal Mar 25 '13 at 15:01
  • related: [Longest subarray whose elements form a continuous sequence](http://stackoverflow.com/questions/15966113/longest-subarray-whose-elements-form-a-continuous-sequence) – jfs Apr 13 '13 at 23:52

1 Answers1

4

If the substring is defined as ''.join(sorted(substr)) in alphabet then:

  • there is no duplicates in the substring and therefore the size of the longest substring is less than (or equal to) the size of the alphabet

  • (ord(max(substr)) - ord(min(substr)) + 1) == len(substr), where ord() returns position in the alphabet (+/- constant) (builtin ord() can be used for lowercase ascii letters)

Here's O(n*m*m)-time, O(m)-space solution, where n is len(input_string) and m is len(alphabet):

from itertools import count

def longest_substr(input_string):
    maxsubstr = input_string[0:0] # empty slice (to accept subclasses of str)
    for start in range(len(input_string)): # O(n)
        for end in count(start + len(maxsubstr) + 1): # O(m)
            substr = input_string[start:end] # O(m)
            if len(set(substr)) != (end - start): # found duplicates or EOS
                break
            if (ord(max(substr)) - ord(min(substr)) + 1) == len(substr):
                maxsubstr = substr
    return maxsubstr

Example:

print(longest_substr("owadcbjkl"))
# -> adcb
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • thanks i just have one question why is string slicing O(m) i was assuming that string slicing is fast like array indexing in python. known the length of the char python should be easily able to find the last index right? or Does it traverse all the way look for the end of slice ? – Sandy Mar 26 '13 at 14:16
  • also what if we use additional data structure then can we reduce the time complexity? – Sandy Mar 26 '13 at 14:18
  • 1
    @Sandy: slicing a string returns a *copy* in CPython. To copy `m` characters you need O(m)-time, and O(m)-space. `buffer()` or `memoryview()` could used to avoid copying but `set`, `max`, `min` are O(m) time so it won't improve the time complexity. – jfs Mar 26 '13 at 14:44
  • 1
    @Sandy: You could probably get `O(n*m)` if you keep `min[startpos]` that store a minimum starting at `startpos` till current position and use it to reset the start position for current substr when a duplicate is detected (you could use a `seen` list of size `len(alphabet)` to keep position where duplicate is located). I haven't thought it through. For a fixed `m` (`len(alphabet)`) it just improves the constant factor and doesn't change the time complexity. I'm not sure that sublinear time performance with respect to `n` (`len(input_string)`) is possible. – jfs Mar 26 '13 at 14:44
  • what if we use array of character instead of the actual string itself, maintain the min and max in the inner loop like u said and then every time take the max(prev_max,new_charc) = O(1) and min(prev_min,new_charac) = O(1).. then we can achieve O(N*M) right? do u think we can take this any lower? – Sandy Mar 26 '13 at 15:01
  • unless you need to modify the input string (e.g., to perform the inplace radix sort); you don't need the array of characters. Updating a single `min[startpos]` is `O(1)` but there are `O(m)` of them in the worst case (different `startpos`) that leads to `O(n*m)` algorithm. The loops from the answer are not used in this algorithm. – jfs Mar 26 '13 at 15:50