This is a question arising from a Coursera course on genomics that I'm following. We are currently dealing with search methods, specifically how to detect (align) all occurrences of a shorter DNA sequence (p, a string) within a longer DNA sequence (t, also a string). The approach in question involves the creation of a hash table for t composed of all possible substrings of length k AND their positions (each substring is a so-called kmer). We then search the hash table with substrings of the same length (kmers) derived from p. This involves the creation of an Index class and the eventual use of Python's bisect_left function. Here's the code:
import bisect
import sys
class Index(object):
def __init__(self, t, k):
''' Create index from all substrings of size 'length' '''
self.k = k # k-mer length (k)
self.index = []
for i in range(len(t) - k + 1): # for each k-mer
self.index.append((t[i:i+k], i)) # append (k-mer, offset) pair (tuple)
self.index.sort() # alphabetize by k-mer
def query(self, p):
''' Return index hits for first k-mer of P '''
kmer = p[:self.k] # query with first k-mer
i = bisect.bisect_left(self.index, (kmer, -1)) # binary search
hits = []
while i < len(self.index): # collect matching index entries
if self.index[i][0] != kmer:
break
hits.append(self.index[i][1])
i += 1
return hits
My question concerns the format of the bisect_left function. Python docs gives the syntax for this function as follows:
bisect.bisect_left(a, x, lo=0, hi=len(a), *, key=None)
The bisect_left function used in my course is formatted in this way:
i = bisect.bisect_left(self.index, (kmer, -1))
I understand that the parameter a has been passed the argument self.index, which is the hash table created from t. But then I'm flummoxed by the tuple which follows. I don't see how any of the following parameters accepts a tuple. I can imagine that the hash table is organized as tuples of the form (kmer, position), but none of these entries will have the position -1. The instructor states that he sets the value to -1 for a specific purpose as follows:
"So the first step is to just find the first k basis of p to look them up in the tables. So, I'm going to say kmer equals p up to length k, and we need to do self.k. And now I want to find the first position in the list where this kmer occurs, so I'm going to use bisect to do that quickly. And say i = bisect.bisect_left. And I pass in the list that we're searching for, and then the object that we're searching for. And remember this is a list of tuples so I'm going to pass in the kmer, and then I'm going to put -1. And since all the indices in the list are greater than negative one, this assures that we always get the first occurrence of that."
I mean, I understand the English, but I don't see how bisect can (a) take in a tuple as an argument, and (b) assuming it can (obviously it can), how can it make use of a position, -1, that doesn't exist in the hash table? I have a hunch I'm not cluing in to what's going on in terms of sorting, but I can't put my finger on it.
Many thanks, and sorry for what must be the overly long statement of my question.