16

The goal of the code is to find the longest alphabetical substring within a string.

s = 'xyzbcdezzz'
longest_string = ''
current_string = ''
stringcount = 0

for n in range (len(s) - 1):
    if s[n] <= s[n+1]:
        current_string += (s[n]+s[n+1])
        stringcount += 1
        print('current string:', stringcount, current_string)


    elif s[n] > s[n+1]:
        if len(current_string) > len(longest_string) :
            longest_string = current_string
            current_string = ''
            stringcount = 0
            print('the longest string checked is:', longest_string, ', count reset')

if len(current_string) == len(longest_string):
    print (current_string, longest_string)
if len(current_string) > len(longest_string):
    print (current_string)
if len(longest_string) > len(current_string):
    print(longest_string)

When I run this code, it gives 'abbccd' as the longest substring, when it's actually 'abcd'. This is because it checks the character a, comparing it to the next in the sequence, and then adds a to b giving "ab". Then it checks b, comparing to c and adds bc together, and then adds "bc" to "ab".

To fix this, I've been attempting to make the loop skip the next character if it's in alphabetical order already, and check the next one by increasing the value of 'n' once the condition is met, but this doesn't seem to do anything at all.

Advice, tips, corrections and harsh criticism are all welcomed.

EDIT: It appears I've misled some of you, so I apologise. What I meant was that if I have a string, it extracts the longest possible substring in alphabetical order. In the case of xyzbcdezzz, it will extract 'bcdezzz' because that's the longest possible alphabetical order substring, not bcde. The problem with my current code, is that it gives bccddeezzzzz. If I could skip one loop when the first if condition is true, then I think it might work in my code.

user2314737
  • 27,088
  • 20
  • 102
  • 114
Daedalus
  • 163
  • 1
  • 7
  • 2
    Not the answer but you may want to use this fact. If you take the ascii value of a character in string with ord(char) it will return a number. The value of ord('a')=97, ord('b')=98, ect in incremental steps. – syntaxError Jun 10 '17 at 07:28
  • 1
    would `xyzbcdezzz` give `bcde`, or are you expecting to start from `a`? – JL Peyret Jun 10 '17 at 07:34
  • Hiro has just posted the logic to my statement if you look below. – syntaxError Jun 10 '17 at 07:35
  • 1
    @JLPeyret the goal is for the code to give 'bcdezzz', but it gives 'bccddeezzzzz'. – Daedalus Jun 10 '17 at 07:49
  • @rosh That's a good advice, but one should take note this scheme horribly falls apart if you try to sort strings in languages that use non-ASCII characters. But then again, probably nobody does manual char-by-char sorting outside of simple exercises these days. – Dragomok Jun 10 '17 at 15:52

9 Answers9

11

TL;DR: the last code after the edit solves the problem

This is a variant of the Longest Common Substring Problem.

def longestSubstring(string1, string2):
    answer = ""
    len1, len2 = len(string1), len(string2)
    for i in range(len1):
        match = ""
        for j in range(len2):
            if (i + j < len1 and string1[i + j] == string2[j]):
                match += string2[j]
            else:
                if (len(match) > len(answer)): answer = match
                match = ""
    return answer

alphabets = "abcdefghijklmnopqrstuvwxyz"
s = 'jamabcdaskl'

print('longest substring:', longestSubstring(s,alphabets))

Credits to this post for the subroutine.

Edit:

It seems like the above code doesn't work for all cases, so I had to redesign the function.

def longestAlphaSubstring(str2):
    str1 = "abcdefghijklmnopqrstuvwxyz"
    longest = ""
    for i in range(len(str1)+1):
        if str1[:i] in str2 and len(str1[:i])>len(longest):
            longest = str1[:i]
    return longest

print(longestAlphaSubstring('jamabcdaskl'))
print(longestAlphaSubstring('asdklfjalkdfjabcdefghijklmnopqrstuvwxyzaaabcasdkfjl;kasdf'))

Output:

abcd
abcdefghijklmnopqrstuvwxyz

This works on the assumption that the substring should always begin with a. This iterates through every possible substring from 'a', 'ab', 'abc', ... , up to the full string of alphabets and then stores the longest substring encountered in the check.

For the sake of completeness, here is the code that would work for any longest common substring:

def longestSubstring(str1, str2):
    longest = ""
    for j in range(len(str1)):
        for i in range(len(str1)+1):
            if str1[j:i] in str2 and len(str1[j:i])>len(longest):
                longest = str1[j:i]
    return longest

where one string contains the alphabets in order and the other contains the test string. Be aware that this is O(n^2) in complexity (not that it matters for small cases).

Ébe Isaac
  • 11,563
  • 17
  • 64
  • 97
  • It appears to not, although I'm unsure why. I assume string 1 is supposed to be the test string, while string 2 is supposed to be the alphabet? When I input 'adfhanmzkl' It should give 'adfgh', I think. Which is the longest alphabetical substring. Edit: it gives kl, which are two letters sequenced right next to each other in the alphabet. is that what it checks for? – Daedalus Jun 10 '17 at 15:52
  • @Daedalus That is *no* substring; it deviates completely from the originally posted concept. – Ébe Isaac Jun 10 '17 at 15:54
  • I meant 'adfh', apologies. I added another example to my original post to make it clearer I think. If what I'm asking for isn't a sub-string, then I'm confused with terminology. – Daedalus Jun 10 '17 at 16:02
8

a different version looping over zip(strg, strg[1:]) in order to compare the previous and the current character in the same loop iteration:

def longest_substring(strg):

    substring = strg[0]
    max_substring = ''

    for cur, nxt in zip(strg, strg[1:]):

        if ord(nxt) >= ord(cur):
            substring += nxt
            if len(substring) > len(max_substring):
                max_substring = substring
        else:
            substring = nxt

    return max_substring

comparing the characters with ord this way has the disadvantage that these characters !"#$%&'()*+,-./0123456789:;=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_``abcdefghijklmnopqrstuvwxyz{|}~ will be considered as 'in alphabetical order'. you may need to tweak that to your needs...

hiro protagonist
  • 44,693
  • 14
  • 86
  • 111
  • you could "fix" the fact that `ord` works on more than just `[a-z]` by defining your own. `def ord2(c): = if c in string.ascii_lowercase: return ord(c) else raise ValueError("ord2 only handles [a-z], got " + c)` and using that instead. Or even better just go direct: `def nextch(c): return ord(c)+1 if c in string.ascii_lowercase else None`, then `if prev == nextch(cur): # success` – Adam Smith Jun 10 '17 at 08:23
  • 1
    @AdamSmith: thanks! you're right, i could to that... was just not sure if that is necessary. (and thanks for finally pointing out that `ascii_lowercase` is defined in python already!) – hiro protagonist Jun 10 '17 at 08:27
  • Ha, yes. Just `import string` first (I wrote that as `strings` originally -- had my head in a different language! It's fixed now) – Adam Smith Jun 10 '17 at 08:28
  • this prints '' with `longest_substring('jamchjdaskl')` it should be 'kl' – salparadise Jun 10 '17 at 09:56
  • @salparadise: thanks for pointing that out! yup, that was a bug. fixed. – hiro protagonist Jun 10 '17 at 10:20
  • 1
    Not sure why this isn't the top voted answer, fastest and correct. – salparadise Jun 11 '17 at 02:41
8

From an algorithmic perspective, the most optimized approach is using a Suffix Tree in order to find the longest sub string in a given string. (I've implemented an optimized version of the suffix tree in python a while ago. In case you're interested you can check https://github.com/kasramvd/SuffixTree

As another hacky way you can utilize the Numpy in order to find the largest sub-string using the diff(), where() and split() functions:

In [52]: s = 'pqsrjamabcdaskl'

In [54]: ''.join(max(np.split(list(s), np.where((np.diff([ord(i) for i in s]) == 1) == False)[0] + 1), key=len))
Out[54]: 'abcd'

Explanation:

The logic behind this code is to find the indices of the characters that the difference of their ascii value is 1.

In numpy we can simply do that by np.diff function. But since it needs an array of items we can use a list comprehension to pass the list of intended values to the function. Then by comparing the result with 1 we can get a list of bool array as follows:

In [55]: np.diff([ord(i) for i in s]) == 1           
Out[55]: 
array([ True, False, False, False, False, False, False,  True,  True,
        True, False, False, False,  True], dtype=bool)

Now we can get the indices of the False items using np.where to pass them to split function:

In [57]: np.split(list(s), np.where((np.diff([ord(i) for i in s]) == 1) == False)[0] + 1)
Out[57]: 
[array(['p', 'q'], 
       dtype='<U1'), array(['s'], 
       dtype='<U1'), array(['r'], 
       dtype='<U1'), array(['j'], 
       dtype='<U1'), array(['a'], 
       dtype='<U1'), array(['m'], 
       dtype='<U1'), array(['a', 'b', 'c', 'd'], 
       dtype='<U1'), array(['a'], 
       dtype='<U1'), array(['s'], 
       dtype='<U1'), array(['k', 'l'], 
       dtype='<U1')]

The +1 is actually because np.split splits from 0 to our first index and then from first index to the next and etc.

And at the end we can get the longest array using max function by passing the len as the key function.

Also note that this approach will give you the longest consecutive sequence, if you just care about the order you can replace the == 1 with > 0. Here is an example:

In [13]: s = 'pqsrjamabcdasklptvxz'

In [14]: ''.join(max(np.split(list(s), np.where((np.diff([ord(i) for i in s]) > 0) == False)[0] + 1), key=len))
Out[14]: 'klptvxz'
Jason Aller
  • 3,541
  • 28
  • 38
  • 38
Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • `!= 0` -> `> 0`? – Mad Physicist Jun 10 '17 at 08:02
  • @MadPhysicist `>0` is not correct. But there as a minor bug when there are two consecutive character at the beginning. Also there is not need to use two `diff`, the updated version works in both cases. – Mazdak Jun 10 '17 at 08:29
3

Here's a simple, efficient and (IMO) quite readable solution, which dodges the "what does alphabetical order actually mean" issue by taking a custom test function as a parameter:

def longest_matching_substring(s, match):
    current_run_length = 1
    longest_run_length = 1
    longest_run_end = 0

    for i in range(1, len(s)):
        if match(s[i-1], s[i]):
            current_run_length += 1
        else:
            current_run_length = 1

        if current_run_length > longest_run_length:
            longest_run_length = current_run_length
            longest_run_end = i

    longest_run_start = longest_run_end - longest_run_length + 1
    return s[longest_run_start:longest_run_end+1]

You can use it e.g. like this:

print(longest_matching_substring('jamabcdaskl', lambda a,b: a < b))

which will print "abcd". If you'd like to use a case-insensitive comparison, and/or to ignore non-alphabetic characters entirely, you can do that by changing the test function, e.g. like this:

def is_alphabetized(a, b):
    return a.lower() < b.lower()

# this prints "eSt"
print(longest_matching_substring('TeStInG', is_alphabetized))

Of course, by defining a suitable test function, you can also use the same longest_matching_substring function to find the longest consecutive substring where the adjacent pairs of characters satisfy any arbitrary criterion (like, say, "does not contain a consonant followed by a vowel"). And you can even use the same function for finding longest matching consecutive subsequences in other sequence types like lists and tuples, not just in strings.

(This implementation does not, however, work for arbitrary iterable types; to handle those, we'd have to memorize the current and the longest matching substring as we iterate over the input. While doable, that would somewhat complicate the code, and also make it less efficient for ordinary strings and other sequence types.)

Ilmari Karonen
  • 49,047
  • 9
  • 93
  • 153
3

After your edit, it is clearer what was your question. I have modified your code as little as possible to show you where the bug in your solution came from.

Here is the code:

s = 'xyzbcdezzz'
longest_string = ''
current_string = ''

for n in range(len(s)):
    if len(current_string) == 0 or current_string[-1] <= s[n]:
        current_string += s[n]
        print('current string:', len(current_string), current_string)
    else:
        if len(current_string) > len(longest_string):
            longest_string = current_string
        current_string = s[n]
        print('the longest string checked is:', longest_string, ', count reset')

if len(current_string) > len(longest_string):
    longest_string = current_string

print(longest_string)

the problematic part was taking 2 chars at once in

if s[n] <= s[n+1]:
    current_string += (s[n]+s[n+1])

by replacing it with

if len(current_string) == 0 or current_string[-1] <= s[n]:
        current_string += s[n]

you will be adding to the current string exactly if the addition is valid (the last char current_string[-1] and the added wannabe s[n] are in order)

The elif part is simplified to not check s[n] and s[n+1] because it does not reflect what we are trying to do : we do not care if the chars are not in order in the whole string s we care about our current string (this logic is caught by the if statement above and else will be visited only if there is a problem)

so the change here is

elif s[n] > s[n+1]:
    if len(current_string) > len(longest_string) :

to

else:
    if len(current_string) > len(longest_string):
        longest_string = current_string
    current_string = s[n]

adding a new winner if necessary and resetting the current string to the char that was not in order

The last set of ifs are checking the case when current_string ends on the last char of the s, while this is correct it would be less distracting if you add a check after the loop and print only the longest_string

if len(current_string) > len(longest_string):
        longest_string = current_string
print(longest_string)

this way the output is the first valid longest string in every case and not 2 different longest ones when one of them is on the tail of the string

Derte Trdelnik
  • 2,656
  • 1
  • 22
  • 26
1

Using iteration:

import string
alphabet = string.ascii_lowercase
def check_alpha(sut):
    iterate_alpha = iter(alphabet)
    max_longest = ''
    current_longest = ''
    compare_against = next(iterate_alpha)
    for l in sut:
        if l == compare_against:
            current_longest += l
            compare_against = next(iterate_alpha, '')
        else:
            max_longest = max(current_longest, max_longest, key=len)
            iterate_alpha = iter(alphabet)
            current_longest = next((x for x in iterate_alpha if x == l), '')
            compare_against = next(iterate_alpha, '')
    return max(current_longest, max_longest, key=len)

In [39]: assert 'abcdefghi' == check_alpha('abcdezdflkjabcdefghiasldfjlkasdfjkaaabb')
In [40]: assert 'abcd' == check_alpha('jamabcdaskl')
In [83]: assert 'abcde' == check_alpha('aldksfjklasabcdedjfkladabcd')
In [89]: assert 'abcdefghijklmnopqrst' == check_alpha('adfadsabcdefghijklmnopqrst')
In [96]: assert 'ghijklmnopqrst' ==  check_alpha('adfadscghijklmnopqrst')
salparadise
  • 5,699
  • 1
  • 26
  • 32
  • Shouldn't the last assert compare `'cghijklmnopqrst'`? – DocZerø Jun 11 '17 at 12:04
  • @kristof I wrote it thinking strict alphabetical order, but there was a bit of 'scope creep' on the part of OP's question, so don't know what is correct to be fair. – salparadise Jun 11 '17 at 17:36
1

How about using a base string of the alphebetic characters and check if the substring is into this base string then return the max substring found in the base string based in its lenght ?

This is an example:

def get_max_substring(a):
    base = 'abcdefghijklmnopqrstuvwxyz'
    possibilites = (a[k:k+i] for k in range(len(a)) for i in range(k, len(a)))
    sub = (k for k in possibilites if k in base)
    return max(sub, key= lambda x: len(x))

# Tests
a = 'jamabcdaskl'
final = get_max_substring(a)
print(final)

b = 'asdklfjalkdfjabcdefghijklmnopqrstuvwxyzaaabcasdkfjl;kasdf'
test = get_max_substring(b)
print(test)

Output:

abcd
abcdefghijklmnopqrstuvwxyz
Chiheb Nexus
  • 9,104
  • 4
  • 30
  • 43
  • 1
    I should clarify, that it isn't looking to take the alphabet out of a string, but any substring of characters that are in alphabetical order and display them. Maybe I'm testing it wrong, but this code doesn't seem to work when the test doesn't have "abc" in it. If it was "jamchjdaskl", it should print "chj" – Daedalus Jun 10 '17 at 08:30
  • @Daedalus, your comment intrigued me. what's the return for `xyzbcdezzz` ? My answer and the others with a positive score are returning `bcde`. – Chiheb Nexus Jun 10 '17 at 08:51
  • 1
    It should give 'bcdezzz'. My apologies if I didn't make that clear. You're right, the other codes do seem to do that. – Daedalus Jun 10 '17 at 15:54
1

Here's my implementation:

import itertools

def pairwise(iterable):
    # Taken from https://docs.python.org/3.6/library/itertools.html#itertools-recipes
    a, b = itertools.tee(iterable)
    next(b, None)
    return zip(a, b)


def longest_sorted_substr(s):
    # Split up in pairs
    pairs = pairwise(s)

    all_substrings = []
    curr_substring = []

    for i, pair in enumerate(pairs):
        if ord(pair[0]) <= ord(pair[1]):
            curr_substring.append(s[i])
        else:
            # Start a new substring
            curr_substring.append(s[i])
            all_substrings.append(''.join(curr_substring))
            curr_substring = []

    # Don't forget to add the last character and append the substring
    curr_substring.append(s[-1])
    all_substrings.append(''.join(curr_substring))

    # Sort the substrings according to length and return the first one
    all_substrings.sort(key=lambda x: len(x), 
                        reverse=True)
    return all_substrings[0]

Tests (taken from @salparadise's answer):

assert 'abcdefghi' == longest_sorted_substr('abcdezdflkjabcdefghiasldfjlkasdfjkaaabb')
assert 'abcd' == longest_sorted_substr('jamabcdaskl')
assert 'abcde' == longest_sorted_substr('aldksfjklasabcdedjfkladabcd')
assert 'abcdefghijklmnopqrst' == longest_sorted_substr('adfadsabcdefghijklmnopqrst')
assert 'cghijklmnopqrst' == longest_sorted_substr('adfadscghijklmnopqrst')

I haven't compared the performance against the other proposed answers and keep in mind that it is case sensitive and expects the string to consist of characters.

If necessary, you could first extract only the letters from the string and convert it to lowercase by using:

import string
s = ''.join([letter for letter in s.lower() if letter in string.ascii_letters])
DocZerø
  • 8,037
  • 11
  • 38
  • 66
0

I Tweaked you code a little, and tried it. It seems to work perfectly. I am still a learner, don't mind my minor mistakes if there are any. Add your string as s='....'

x = 'abcdefghijklmnopqrstuvwxyz'
current_string = ''
count=0
longest_string = ''
for p in range(len(s)-1):

    if x.index(s[p+1])-x.index(s[p]) < 0 and count == 0:
       longest_string = s[count]
       p +=1
    elif x.index(s[p+1])-x.index(s[p]) < 0:
       current_string = ''

    elif x.index(s[p+1])-x.index(s[p]) >= 0:
       current_string += s[p]
       count += 1
       if len (longest_string) < len (current_string + s[p+1]):
           longest_string = current_string + s[p+1]


print('Longest substring in alphabetical order is:', longest_string)