linear searching with two letters python

Question

I have this program which should return (using linear searching) a list of all instances of single characters in 'corpus' that immediately follow 'last' (including duplicates). The characters should be in that same order as they appear in the corpus

Example:

    filter_possible_chars('lazy languid line', 'la')
        ['z', 'n']
        filter_possible_chars('pitter patter batton', 'tt')
        ['e', 'e', 'o']
filter_possible_chars('pitter pattor batt', 'tt')
    ['e', 'o']

But my program runs into a problem for the second example where after the third tt in the word batt, there is nothing after it so it obviously shouldnt put anything else in the list but I get the IndexError list index out of range?

This is the function:

def filter_possible_chars(corpus, last):

listo = []
last_list = []
final = []

for thing in corpus:
    listo.append(thing)
for last_word in last:
    last_list.append(last_word)
    
    
for index, letter in enumerate(listo):
    
    if letter == last_list[0]:
        if listo[index+1] == last_list[1]:
            final.append(listo[index+2])  
print(final)

score 0 · Answer 1 · answered Jan 10 '21 at 02:33

You seem to have identified the issue; you are sometimes trying to access a list element with an index that exceeds the maximum index of your list here: final.append(listo[index+2]) or here listo[index+1].

You can define a helper method that checks first that access will be successful.

def get(_list, index):
    if len(_list) >= index - 1:
        return _list[index]

my_list = [1, 2, 3]
idx = get(my_list, 2) # 3
idx = get(my_list, 4) # None
if idx is not None:
  # do stuff

score 0 · Answer 2 · answered Jan 10 '21 at 02:37

The problem you are running into is that 'tt' is at the end of your third string, so when trying to find the letter after that, you increase the index, but the string has reached it's end, and when trying to increase the index by one, you end up asking for a character that does not exist

First, if you wanted to have it return the first character of the string in this case, use the modulus operator to reduce the value to zero if it goes over:

def filter_possible_chars(corpus, last):

    listo = []
    last_list = []
    final = []

    for thing in corpus:
        listo.append(thing)
    for last_word in last:
        last_list.append(last_word)


    for index, letter in enumerate(listo):

        if letter == last_list[0]:
            if listo[(index+1)%len(corpus)] == last_list[1]:
                final.append(listo[(index+2)%len(corpus)])
    print(final)

Or, if you wanted it to return None in this situation, you could add an if statement to detect if it's at it's limit, and if it is, do nothing, and then skip to the end of the function using return

I think using the modulo operator here is incorrect because while it does prevent an out-of-bounds error, it can now result in false positives. e.g. "abcdefg"[10%7] will sometimes give you a letter that might match your query when it never should. You need to detect and handle the out-of-bounds case, not mask it. — DragonBobZ, Jan 10 '21 at 02:55

score 0 · Answer 3 · answered Jan 10 '21 at 02:48

Well, try this, it fixes the index problem:

import re

query_list = [
['lazy languid line', 'la'],
['pitter patter batton', 'tt'],
['pitter pattor batt', 'tt']
]


def search(query):
    query_string = query[0]
    query_key = query[1]
    result = []
    for match in re.finditer(query_key, query_string):
        if match.span()[-1] < len(query_string):
            result.append(query_string[match.span()[-1]])
        else:
            result.append(None)
    return result

for query in query_list:
    result = search(query)
    print (query)
    print (result)

Output:

['lazy languid line', 'la']
['z', 'n']
['pitter patter batton', 'tt']
['e', 'e', 'o']
['pitter pattor batt', 'tt']
['e', 'o', None]

score 0 · Answer 4 · answered Jan 10 '21 at 03:25

there is nothing after it so it obviously shouldnt put anything else in the list

When the code comes to the second-last t, both if conditions are True and it tries to get listo[index+2] which is nothing, so it raises IndexError to tell you I don't know what you want me to get. It happens once again when it comes to the last t, trying to get listo[index+1].

You can just stop searching at the third-last character:

def filter_possible_chars(corpus, last):
    result = []
    for i in range(len(corpus)-2):
        if corpus[i:i+2] == last:
            result.append(corpus[i+2])
    print(result)

score 0 · Answer 5 · answered Jan 10 '21 at 03:44

You can do this using a list comprehension.

def filter_possible_chars(corpus, last):
    parts = [word.split(last) for word in corpus.split() if last in word]
    return [w[1][0] for w in parts if w[1]]

print (filter_possible_chars('lazy languid line', 'la'))
print (filter_possible_chars('pitter patter batton', 'tt'))
print (filter_possible_chars('pitter pattor batt', 'tt'))
print (filter_possible_chars('pitter pattor batt', 'it'))
print (filter_possible_chars('pitter pattor batt', 'er'))
print (filter_possible_chars('pitter pattor batt', 'ox'))

You can combine the two lines into one long list comprehension as follows:

return [word.split(last)[1][0] for word in corpus.split() if last in word and word.split(last)[1]]

Let me explain the code:

parts = [word.split(last) for word in corpus.split() if last in word]

Here I am trying to split the corpus into individual words using

for word in corpus.split()

After that, I am checking if last is in the individual word

If the substring last exists, then i am splitting the word again with last as the substring. This will give two sets of strings. The first part will be all characters before the substring in last and the second part will be all characters after the substring in last.

As an example, lazy will get split as ['', 'zy'] for substring la. Whereas pitter will be split as ['pi', 'er'] for tt

Once you have this list, then you need to pick the first character from index 1.

for search la:

lazy languid line will result in [['', 'zy'], ['', 'nguid']]

for search `tt':

pitter patter batton will result in [['pi', 'er'], ['pa', 'er'], ['ba', 'on']]

for search `tt':

pitter pattor batt will result in [['pi', 'er'], ['pa', 'or'], ['ba', '']]

for search `er':

pitter pattor batt will result in []

for search `ox':

pitter pattor batt will result in []

This tells us that we can pick all the results provided the index 1 value has a string.

So the next list comprehension statement is:

return [w[1][0] for w in parts if w[1]]

Here, we are extracting each block from parts and checking if index of 1 contains any string. If yes, then extract the 0th position and return it.

The output of the following statements are:

print (filter_possible_chars('lazy languid line', 'la'))
print (filter_possible_chars('pitter patter batton', 'tt'))
print (filter_possible_chars('pitter pattor batt', 'tt'))
print (filter_possible_chars('pitter pattor batt', 'it'))
print (filter_possible_chars('pitter pattor batt', 'er'))
print (filter_possible_chars('pitter pattor batt', 'ox'))

['z', 'n']
['e', 'e', 'o']
['e', 'o']
['t']
[]
[]

linear searching with two letters python

5 Answers5