6

Given the target ('b', 'a') and the inputs:

x0 = ('b', 'a', 'z', 'z')
x1 = ('b', 'a', 'z', 'z')
x2 = ('z', 'z', 'a', 'a')
x3 = ('z', 'b', 'a', 'a')

The aim is to find the location of the continuous ('b', 'a') element and get the output:

>>> find_ba(x0)
0
>>> find_ba(x1)
0
>>> find_ba(x2)
None
>>> find_ba(x3)
1

Using the pairwise recipe:

from itertools import tee
def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)

I could do this to get the desired output:

def find_ba(x, target=('b', 'a')):
    try:
        return next(i for i, pair in enumerate(pairwise(x)) if pair == target)
    except StopIteration:
        return None

But that would require me to loop through all pairs of characters till I find the first instance. Is there a way to find index of pairwise elements without looping all the characters?


Answering @MatthiasFripp's question in the comments:

Are your elements in lists or types (as shown) or in a generator (e.g. reading from a file handle)?

The x* are all tuples of strings. So they can be access through the index. But if the answer/solution can work for tuples and generator, that'll be great!

Can you say about how many lists you have to search and about how long they are? That would help for suggesting a search strategy.

The lengths of the tuples are not fixed. They can be of size > 2.

elz
  • 5,338
  • 3
  • 28
  • 30
alvas
  • 115,346
  • 109
  • 446
  • 738
  • Are all your inputs of length 4? – Blender Apr 26 '17 at 09:17
  • 1
    if ab in "".join(x0) ? – Liam Apr 26 '17 at 09:19
  • @Liam There's a hidden loop there. – Maroun Apr 26 '17 at 09:19
  • 6
    What do you mean, "without looping all the elements"? Surely you have to look at each element at least once in order to determine that the tuple is not in the list. (You might argue that it is enough to look at every 2nd element, but then you'd have to compare that to _both_ elements in the tuple, wouldn't you?) – tobias_k Apr 26 '17 at 09:21
  • @Blender Nope the input can be arbitrarily sized. – alvas Apr 26 '17 at 09:23
  • 1
    @alvas I don't think it's possible even if you use a dict to access a pair in constant time. Still you need to loop through pairs once to construct. – Ozgur Vatansever Apr 26 '17 at 09:34
  • 2
    You don't need a try-except clause, `return next((i for i, pair in enumerate(pairwise(x)) if pair == target), None)` – Chris_Rands Apr 26 '17 at 09:34
  • Do you mean you don't want to loop both, i.e. you don't like the `tee`? You could loop checking every other character, and, if it matches one of `('b', 'a')` check the previous or next for `'a'` or `'b'`. – Peter Wood Apr 26 '17 at 09:44
  • 1
    @OzgurVatansever: that assumes that the OP is using an older version of Python. The recipe's the one from python 3 and earlier questions from the OP make it seem like he's up to date. – DSM Apr 26 '17 at 09:47
  • Why don't you want to loop? – Peter Wood Apr 26 '17 at 10:05
  • Cos the input could be neurotically long =) – alvas Apr 26 '17 at 10:08
  • 2
    @alvas Conceptually can you explain in your question how you expect this to be possible without explicit or implicit looping? – Chris_Rands Apr 26 '17 at 10:26
  • @alvas I'm not sure how you expect to find something without looking. You could build a lookup table but you'd have to loop to build it. If the data were sorted you could do a binary search. – Peter Wood Apr 26 '17 at 13:40
  • Are your elements in lists or types (as shown) or in a generator (e.g. reading from a file handle)? Can you say about how many lists you have to search and about how long they are? That would help for suggesting a search strategy. – Matthias Fripp May 03 '17 at 23:19
  • For each input, are you only interested in the first occurrence of `("b", "a")` or all occurrences? – pylang May 05 '17 at 22:50
  • Yes, the first occurrence. – alvas May 06 '17 at 01:56
  • @alvas at least comment so we get idea what's wrong we answered!! – DexJ May 08 '17 at 05:24
  • @DexJ I'm not the one who downvoted the answers, I have not given any downvotes for this question ;P – alvas May 08 '17 at 06:48
  • oh! ok then it's weird cause all answers are downvoted! without any explanation or comment – DexJ May 08 '17 at 09:39
  • Whoever downvoted shouldn't hide in the shadow.... I'm going to upvote all questions by +1 to be fair... – alvas May 09 '17 at 00:52
  • You might be interested in this article on string searching; similar issues apply to your search. http://old.blog.phusion.nl/2010/12/06/efficient-substring-searching/ – Matthias Fripp May 09 '17 at 08:26
  • @alvas are you searching the same inputs multiple times with different search terms? In that case there are various ways you could do some advance prep and then get nearly instantaneous searches. – Matthias Fripp May 09 '17 at 08:52

15 Answers15

13

The fastest general search algorithm will have O(n) average performance (called linear search), that means you have no alternative (except maybe for a constant factor) than to process each element.

Given your question:

Is there a way to finding index of pairwise elements without looping all the characters?

That's possible (it's still O(n) though) by only looking at each second item:

from itertools import count

def find_ab(tup):
    for idx in count(start=1, step=2):
        try:
            if tup[idx] == 'b':
                if tup[idx+1] == 'a':
                    return idx
            elif tup[idx] == 'a':
                if tup[idx-1] == 'b':
                    return idx-1
        except IndexError:
            break

In the worst case it will still compare all items but it will skip one item for every odd-indexed item that isn't 'b' or 'a'.

That's a bit like cheating so let me explain why common alternatives aren't possible in your case:

Binary search

Binary search only needs to compare log(n) items but it requires the sequence to be sorted. Your examples aren't sorted so sorting them would require O(n*log(n)) operations - which wouldn't just process each item once, it would process some of them several times. Not that I know a sensible way to sort adjacent elements anyway.

Bucket search (or hashtables)

You have tuples so creating a hashtable (a dict) doesn't make sense because in order to create that structure you would need to process each element.

But if you plan to do several of these searches for pairs you could create the dictionary once (O(n)) and do many searches afterwards in O(1):

d = {}
for idx, pair in enumerate(pairwise(x0)):
    if pair not in d:    # keep only the first index for each pair
        d[pair] = idx

>>> d.get(('b', 'a'), None)
0

However that approach is far slower if you only want to search for one pair, because you lose the "short-circuit behaviour" (stops as soon as a match is found) and you process all elements when creating the dictionary.

Other approaches

Besides the general approaches:

  • O(n) linear search
  • O(log(n)) binary search (for sorted data)
  • O(1) lookups (for hashable lookups or other search problems that just need to search in some "bucket"s)

You can generally exploit any structure or knowledge about your data to reduce the amount of items you need to process. The problem is mostly that there are (probably) no already existing data structures for these and homemade implementations often end up orders of magnitude slower than naive "process all elements" approaches. But if you have any meta-informations about your sequences then you can exploit it.

Final remarks

The recipe for pairwise is actually quite nice, but you could also use iteration_utilities.successive1. Last I checked it was roughly 1.5 to 2 times faster than the recipe. Even if you don't change the approach and accept that you need to process all (or almost all) elements in the worst case it may be faster!

That data was probably generated. Maybe it's worthwhile to actually "search" for the elements during creation. That way you don't need to do the extra pass over the data at all. Or you could create the dict while creating the dataset (that allows for the O(1) lookups afterwards). Sometimes it's a good idea to look at the process that generated/downloaded/fetched the dataset if there is some way the information can be extracted.

Now, after writing all this text I need to state the obvious:

Your approach is really good. Even if it needs to process all elements in the worst case it uses a perfect fit (pairwise-recipe) for the problem at hand and it should actually work really fast even for long inputs. For a tuple containing 1 million 'z' it only needs 200ms on my computer. So you can process several million elements per second (even on old and slow computers like mine). That's probably not fast enough for big-data but then pure-python isn't a good language for processing big-data (typically you would need to write a C extension, use Cython or some NumPy, Pandas or derivative approach). Also the next function on a generator is lazy (assuming you use itertools.izip on python2 instead of zip) so you only process each tuple until you find a match.

Personally, I would simply use your original approach. Or if I had to find several pairs then I would just create the dictionary I mentioned earlier (maybe even serialize it) and do the lookups therein.


The bounty reason explicitly wants "credible and/or official sources". Fortunatly "search algorithms" are well studied so you can find explanations for each of the mentioned approaches in basic text books on algorithms. For example:

There's also a small overview of time complexities of python types in the python wiki: "TimeComplexity". For lookups you have to check "Get Item " or "in".


1 Disclosure: I'm the author of that 3rd party library.

Ozgur Vatansever
  • 49,246
  • 17
  • 84
  • 119
MSeifert
  • 145,886
  • 38
  • 333
  • 352
2

Not much impressive though its working in your case check it out.

we are simply extracting index of matched item in sample and checking if it is consecutive or not.

def consecutive_index(src,sample):
    result = None
    il = [src.index(a) for a in sample if a in src]
    if len(il) == len(sample) and len(range(il[0],il[-1]))==1:
        result = il[0]
    return result



x0 = ('b', 'a', 'z', 'z')
x1 = ('b', 'a', 'z', 'z')
x2 = ('z', 'z', 'a', 'a')
x3 = ('z', 'b', 'a', 'a')
sample = ('b', 'a')

##TEST your given combinations.
print consecutive_index(x0,sample) #expected 0
print consecutive_index(x1,sample) #expected 0
print consecutive_index(x2,sample) #expected None
print consecutive_index(x3,sample) #expected 1
DexJ
  • 1,264
  • 13
  • 24
  • I haven't downvoted but the `tuple.index` method will only return the first occurence so you wouldn't find a match if the first occurence isn't a match: `consecutive_index(('a', 'b', 'z', 'b', 'a'), ('b', 'a'))` returns `None` (but there's a match at index 4). The question didn't mention any "structure"s in the tuples so I'm not sure if your approach is correct. Also the `if a in src` processes all elements of the `src` if `a` is not in `src`, so it's not exactly solving the part of the question "without looping all the characters". – MSeifert May 08 '17 at 10:43
1

Maybe for example using regular expressions? Below you can find two functions. findPair will return values exactly like in your example. findPairs will look for all non-overlapping occurrences and return their starting positions in a list.

import re

# Function looks for all non-overlapping occurrences of pair (b, a) 
# and returns a list containing their starting positions
def findPairs(x, b, a):
    x = str().join(x)
    y = str().join([str(b), str(a)])
    try:
        return [x.regs[0][0] for x in list(re.finditer(y, x))]
    except AttributeError:
        return None

# Function looks for first occurrence of the pair (b, a) 
# and returns starting position if there was a match 
# or None when the match was not found
def findPair(x, b, a):
    x = str().join(x)
    y = str().join([str(b), str(a)])
    try:
        return re.search(y, x).regs[0][0]
    except AttributeError:
        return None


if __name__ == "__main__":
    # first occurrence
    x0 = ('b', 'a', 'z', 'z')
    x1 = ('b', 'a', 'z', 'z')
    x2 = ('z', 'z', 'a', 'a')
    x3 = ('z', 'b', 'a', 'a')

    outx0 = findPair(x0, 'b', 'a')  # 0
    outx1 = findPair(x1, 'b', 'a')  # 0
    outx2 = findPair(x2, 'b', 'a')  # None
    outx3 = findPair(x3, 'b', 'a')  # 1

    # multiple occurrences:
    x4 = ('z', 'b', 'a', 'a', 'z', 'b', 'a', 'a')
    outx4 = findPairs(x4, 'b', 'a')  # [1, 5]

EDIT:

If you don't want / don't like regular expressions, and you are interested only in the first occurrence, you can simply use method find() and define the function for finding pairs as:

def findPairNoRe(x, b, a):
    y = str().join([str(b), str(a)])
    res = str().join(x).find(y)
    if res == -1:
        return None
    else:
        return res
machnic
  • 2,304
  • 2
  • 17
  • 21
  • 2
    `str.join()`, `str.find()` and regex will still iterate. Converting to a string *first* is a lot of work for something the OP had already solved reasonably efficiently. – Martijn Pieters May 04 '17 at 13:26
1

There are shorter formulas for this, but no way to avoid looping entirely. However, you could speed it up via multiprocessing (see end). First, here are some search methods (all O(n)) with various mixes of speed and simplicity.

Fairly simple, fast code to use if values are in tuples or lists:

def find_ba(tup, target):
    last_check = len(tup)-len(target)
    for i, c in enumerate(tup):
        # note: the test below only uses c 95% of the time, 
        # which makes it pretty fast
        if c == target[0] and i <= last_check and tup[i:i+len(target)] == target:
            return i
    return None

Not so simple, but faster, inspired by @MSeifert, but optimized for longer targets:

def find_ba(tup, target):
    import itertools
    search = set(target)
    target_len = len(target)
    for i in count(start=1, step=target_len):
        try:
            if tup[i] in search:  # O(1) reverse lookup
                # search in this neighborhood
                c = tup[i]
                j = 0
                while True:
                    try:
                        # find next occurrence of c in the target
                        j = target[j:].index(c)
                    except ValueError:  # no more occurrences of c in target
                        break
                    # align tup and target and check for a match
                    if j >= i and tup[i-j:i-j+target_len] == target:
                        return i-j
        except IndexError:
            break
    return None

Since you're already going to the trouble to construct tuples of characters, you could construct strings instead, and then let Python do the optimization in native C code:

def find_ba(x, target):
    # assuming x and target are both strings
    pos = x.find(target)
    return pos if pos >= 0 else None

(Although really, you're probably better off doing the search while you create the tuples or strings, if that's possible.)

If the values are in generators, then this will work (pretty similar to what you have already). This will be more efficient than creating long tuples and searching them if the underlying source is slow (e.g., reading items from a disk):

import itertools
def find_ba(lst, target):
    a, b = itertools.tee(lst)
    next(b)
    for i, pair in enumerate(zip(a, b)):
        if pair == target:
            return i
    return None

Note: On Python 2.7, use itertools.izip instead of zip on Python 2.7.

The main way to speed this up would be to use the multiprocessing library. If you have a large number of inputs to process, you could use multiprocessing.Pool.map to send each one to a different worker, in round-robin style. If you have only a few inputs and each is very long, then you might want to use itertools.islice to divide them into longish chunks, then send each chunk to multiprocessing.Pool.map until you get a hit; then you could begin processing the next input. I can't tell from your question which of these approaches would be most effective.

Matthias Fripp
  • 17,670
  • 5
  • 28
  • 45
0

The answer to the question is that no, there is not any way to find pairs without looping all the characters. Because if you don't look at a character, you have no idea whether it matches one of your pair, or not.

You may be able to hide the iteration by making it implicit in a language or library routine, but it's got to be there. Making it implicit might make the code more efficient (if, for example, you move the loop out of the Python interpreter and into a pre-compiled language such as C). Or, it might not.

An (inefficient, silly!) example of hiding stuff might be

def find_ba( x, target=('b','a'), separator = '|' ):
   t = separator.join(target)
   try:
        return  ( separator.join([ c for c in x]).index(t) ) / 2
   except ValueError:
        return None

(Code supplied to the Ministry of Silly walks under contract number SW/l10O/Il0O/01L1lO00/22 and placed in the public domain).

nigel222
  • 7,582
  • 1
  • 14
  • 22
0

Using itertools you can make it lazy, but still need to iterate:

import itertools
def check(x, target):
    for t in itertools.izip(x, itertools.islice(x, 1, len(x))):
        if t == target:
            return True
    return False
check(x0, ('b', 'a'))
True

EDIT: use zip in python3

Netwave
  • 40,134
  • 6
  • 50
  • 93
0

As nigel222 pointed out, there's no way (in the worst case) to avoid iterating over the entire list, since you have to make an exhaustive comparison to be sure the item you want isn't contained in your iterable.

If you're going to be making a lot of these queries on various possible sub-sequences, though, it could be worth it to press it into a set, since sets have O(1) lookup.

...
my_pairwise = set(pairwise(x))
found_subsequences = [subsequence
                      for subsequence in collection_of_subsequences
                      if subsequence in my_pairwise]

This way, the O(n) iteration through your x only occurs once, and each lookup afterwards is O(1).

0

It's not practical but it solves your case

def look_up(needle, haystack):
    i = ''.join(haystack).find(''.join(needle))
    return i if i > -1 else None

So assuming that we have this:

x0 = ('b', 'a', 'z', 'z')
x1 = ('b', 'a', 'z', 'z')
x2 = ('z', 'z', 'a', 'a')
x3 = ('z', 'b', 'a', 'a')
ba = ('b', 'a')

we are getting this:

print(look_up(ba, x0)) # Prints: 0
print(look_up(ba, x1)) # Prints: 0
print(look_up(ba, x2)) # Prints: None
print(look_up(ba, x3)) # Prints: 1

And here's for multiple occurrences:

def look_up_multiple(needle, haystack):
    needle_str = ''.join(needle)
    haystack_str = ''.join(haystack)
    indexes = []
    i = 0
    while i < len(haystack_str):
        i = haystack_str.find(needle_str, i)
        if i > -1:
            indexes.append(i)
        i += 2
    return indexes

And let's run it:

x = ('b', 'a', 'z', 'z', 'b', 'a')
ba = ('b', 'a')

print(look_up_multiple(ba, x)) # Prints: [0, 4]
Tim S.
  • 162
  • 7
0

You can do it by convering lists to string.

def findba(x,target):
    x1 = "".join(x) 
    target1 = "".join(target)
    if target1 in x1:
        return x1.index(target1)
    else:
        return None

ab = ('b','a')
x0 = ('b', 'a', 'z', 'z')
x1 = ('b', 'a', 'z', 'z')
x2 = ('z', 'z', 'a', 'a')
x3 = ('z', 'b', 'a', 'a')

print findba(x0,ab)
print findba(x1,ab)
print findba(x2,ab)
print findba(x3,ab)
Somil
  • 1,921
  • 1
  • 21
  • 35
0

As has already been pointed out, you cannot avoid looping over all the characters. You can make it lazy and iterate only once over the input tuple as follows (assuming Python 3):

from itertools import islice, tee

def find_ba(x):
    pairs = zip(*(islice(g, i, None) for i, g in enumerate(tee(x, 2))))
    return next(
        (i for i, pair in enumerate(pairs) if pair == ('b', 'a')),
        None)
Alicia Garcia-Raboso
  • 13,193
  • 1
  • 43
  • 48
  • The slice `x[1:]` is not lazy, it requires `x` to be a list or tuple. It won't work if `x` is an iterator who's values are too numerous to fit in memory all at once. The `pairwise` recipe used by the Questioner is better. – Blckknght May 05 '17 at 18:36
0

This solution looks for the first element of target using the index method of the list. It then checks if the next item int he list matches the second item of the target. If not it then finds the next occurrence of 'b' and checks the following item again. Wash rinse repeat.

This does not loop over all pairs, but looks for the first item in the expected pair, then checks the next item.

def find_ba(x, target=('b','a')):
    try:
        ind = 0
        while ind < len(x):
            ind += x[ind:].index(target[0])
            if x[ind+1] == target[1]:
                return ind
            ind += 1
    except ValueError:
        return None

Test:

# 100 random letters
letters = ['f', 'y', 'h', 'u', 't', 'l', 'y', 'u', 'm', 'z', 'a', 'a',
           'i', 't', 'g', 'm', 'b', 'l', 'z', 'q', 'g', 'f', 'f', 'b', 
           'b', 'a', 'c', 'z', 'n', 'j', 'v', 'b', 'k', 'j', 'y', 'm', 
           'm', 'f', 'z', 'x', 'f', 'q', 'w', 'h', 'p', 'x', 't', 'n', 
           'm', 'd', 'z', 'q', 'v', 'h', 'b', 'f', 'q', 'd', 'b', 's', 
           'a', 't', 'j', 'm', 'h', 'r', 'd', 'n', 'e', 'k', 'y', 'z', 
           'd', 'e', 'x', 'h', 'r', 'z', 'b', 'n', 'q', 'v', 't', 'q', 
           'f', 'w', 'b', 'w', 'f', 'c', 'f', 'h', 'q', 'o', 'r', 'f', 
           'w', 'w', 'n', 'v']
find_ba(letters)  # 24

Method using zip for comparison:

def find_ba1(x):
    try:
        return [(i,j) for i,j in zip(x[:-1], x[1:])].index(('b', 'a'))
    except ValueError:
        return None

And a little speed test:

%timeit find_ba(letters)
100000 loops, best of 3: 2.31 µs per loop

%timeit find_ba1(letters)
100000 loops, best of 3: 8.4 µs per loop
Brad Campbell
  • 2,969
  • 2
  • 23
  • 21
0

Without any promise as to the nature of the data (i.e. presuming that it is random) the search cannot be better than O(n). At best, you may be able to reduce the number of operations in terms of tilde (i.e. reduce it by a factor) by optimizing the problem using specific information of what you are trying to do including: the size of the target, repeat characters in the target (searching for 'b' 'b' 'a' we could look at every other character and know it must be a 'b' to match our sequence, then look at surrounding characters) or any other information we can attain by a quick analysis on the smaller data set (again, assuming that the sequence list is an unknown quantity). One thing I looked into, for example, was searching for the target by iterating by the length of the target and determining if it was one of the characters we were searching for. The problem with this, of course, is instead of searching every index in the list (we now touch len(list)/len(target) elements) we now perform more operations on each element we touch (In other words, for 'b', 'a' we search every two elements, but we look for two things). This does nothing in terms of decreasing the number of operations, however, it would significantly decrease the number of elements you have to load from a secondary memory store, assuming that you plan to look for targets in rather large sequences and that this is why you are avoiding looping through every element. There are also a number of ways you could use multi-parallelism to increase the efficiency of your search if increasing efficiency is your sole goal. (Just remember to use multiprocessing and not threading if you choose this route since python's threading module only support concurrency, not multi-parallelism due to interpreter bottle-necking the threads).

As a conclusion and to directly answer the question you asked, yes, it is entirely possible to find the index of pairwise elements without looking at every element in the sequence. However, doing so would require to first look at specific information to the problem at hand and then apply this information to the search. I think that the best method would be to search by first analyzing the data and then to perform the method of searching that best fits that input. In other words, if there are repeats, you can use that, but if there isn't, you can fall back on another search.

zlmonroe
  • 330
  • 1
  • 3
  • 10
0

Solution:

You can use numpy where to locate the sequence after building a paired sequence array.

#np.roll(x1,-1) shifts the list leftwise one element. np.core.defchararray.add builds a paired sequence. 
np.where(np.core.defchararray.add(x1,np.roll(x1,-1)) == 'ba')[0]

Test

for x in [x0,x1,x2,x3]:
    print (np.where(np.core.defchararray.add(x,np.roll(x,-1)) == 'ba'))[0]

[0]
[0]
[]
[1]
Allen Qin
  • 19,507
  • 8
  • 51
  • 67
  • 2
    This approach processes each element at least 6 times (twice for `roll` because it's first converted to an array and then rolled - then both arguments for `add` are converted to chararrays, then you do the equality check and finally once for where) and it's not short circuit. Also `roll` has the ugly sideeffect of wrapping the elements around, so you could get a index if the first element is `a` and the last is `b` and every item in between is `z`: `('a', 'z', 'z', 'b')`. – MSeifert May 08 '17 at 08:20
0

I tried to benchmark MSeifert's method and mine. Mine code was derived from MSeifert's code but tried to develop one step further, i.e., jumping to the next target word instead of walking two steps at once. Btw, mine is usually faster, and does not need any package. Please let me know if anyone have any question or comment. Thank you.

05/09/2017 edits:
To respond @Matthias Fripp's comment, I added test tuples of 10k and 100k elements. Mine is still faster for 10k elements but not 100k elements. Therefore, my code is not optimal. I think my method is not the "right" answer either as @MSeifert pointed out because the original question asked about ways not to search for all elements.

import random # to generate data
# Set up data
x0 = ('b', 'a', 'z', 'z')
x1 = ('b', 'a', 'z', 'z')
x2 = ('z', 'z', 'a', 'a')
x3 = ('z', 'b', 'a', 'a')
x4 = tuple([random.choice(x3) for i in xrange(10000)])
x5 = tuple([random.choice(x3) for i in xrange(100000)])

# Set up functions
# My code
def findPairwise(x,target):
    currentX = x
    cumulatedIdx=0
    while(1):
        try:
            idx = currentX.index(target[0])
            try:
                if currentX[idx+1] == target[1]:
                    return(idx+cumulatedIdx)
            except:
                pass
        except:
            break
        currentX = currentX[idx+2:]
        cumulatedIdx += idx+2

# MSeifert's method
from itertools import count
def find_ab(tup,target):
    for idx in count(start=1, step=2):
        try:
            if tup[idx] == target[0]:
                if tup[idx+1] == target[1]:
                    return idx
            elif tup[idx] == target[1]:
                if tup[idx-1] == target[0]:
                    return idx-1
        except IndexError:
            break

Result

In [109]: %timeit findPairwise(x0,target)
The slowest run took 8.66 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.27 µs per loop

In [110]: %timeit find_ab(x0,target)
The slowest run took 5.49 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.04 µs per loop

In [111]: %timeit findPairwise(x1,target)
The slowest run took 4.75 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.46 µs per loop

In [112]: %timeit find_ab(x1,target)
The slowest run took 5.04 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.99 µs per loop

In [113]: %timeit findPairwise(x2,target)
The slowest run took 4.66 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.56 µs per loop

In [114]: %timeit find_ab(x2,target)
The slowest run took 5.89 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 4.25 µs per loop

In [115]: %timeit findPairwise(x3,target)
The slowest run took 8.59 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.28 µs per loop

In [116]: %timeit find_ab(x3,target)
The slowest run took 6.66 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.65 µs per loop

In [151]: %timeit findPairwise(x4,target)
The slowest run took 5.46 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.27 µs per loop

In [152]: %timeit find_ab(x4,target)
The slowest run took 6.21 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.92 µs per loop

In [153]: %timeit findPairwise(x5,target)
1000 loops, best of 3: 325 µs per loop

In [154]: %timeit find_ab(x5,target)
The slowest run took 4.35 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 3.45 µs per loop
  • 2
    Have you tested how `findPairwise()` performs with very long `x` tuples as specified by @alvas? I believe your method makes a complete copy of the remainder of `currentX` at each step, which should have something like O(n^2) performance with long `x` tuples. – Matthias Fripp May 09 '17 at 07:52
  • 1
    It may be faster (when you **don't** hit your `O(n**2)` worst case) but it still has to process each item. The iteration is done implicit in `.index(target[0])` and probably faster than a python-loop but it's still there. My solution wasn't exactly made for speed, but the question wasn't about performance it just asked "how one can avoid looping **all the characters**". Nevertheless a very creative answer! (and I did not downvote you) – MSeifert May 09 '17 at 08:43
  • @MSeifert Thanks for your comments. I think you are right. the `.index()` has searched for all characters. I didn't think of that before. – Chih-Hsu Jack Lin May 10 '17 at 00:07
  • @MatthiasFripp Thanks for your comments. I agree the slicing might be the reason slowing it. It's unclear to me whether the operation of list slicing (shallow copy?) has O(n^2). – Chih-Hsu Jack Lin May 10 '17 at 00:37
  • 1
    @Chih-HsuJackLin Slicing makes a shallow copy of every item in the slice; with short strings, that's pretty much equivalent to copying the items themselves. At any rate, taking a slice of a list is O(n). But you are creating a new slice at every step of your algorithm, and the number of steps in your algorithm is O(n). So the total amount of copying is O(n^2). – Matthias Fripp May 10 '17 at 01:28
  • I see. Thank you very much for your explanation. – Chih-Hsu Jack Lin May 10 '17 at 01:34
0

If you are repeatedly searching for different targets in the same inputs, you could avoid looping through the inputs each time by creating a hash of the location of all unique strings, like the code below. This requires one loop through each input for the initial setup, but then the searches are nearly instantaneous (no looping).

# store first occurrence of each unique 2-char string (O(n))
x1_first = dict()
target_len = 2
for i in range(len(x1)):
    x1_first.setdefault(x1[i:i+target_len], i)

# find first occurrence of a particular string without looping (O(1))
print x1_first.get(('a', 'b'), None)

Note: this is very similar to one of @MSeifert's answers, but shows how to handle arbitrary target lengths. If you have multiple target lengths to worry about, then you would need to create separate dicts for each length, which will be inefficient for storage. In that case, you'll probably do better to create a sorted list of the longest possible targets (e.g. 10 chars) and then search it using bisection (see bisect module). For shorter substrings you would need to scan through the multiple matches and pull out the earliest one.

Matthias Fripp
  • 17,670
  • 5
  • 28
  • 45
  • You know that this is just a slight variation of the approach I already included in my answer? Also use `x1_first.get` with default, otherwise you'll get `KeyError`s when the "particular string" isn't in the dictionary, i.e. for `x2` (like I did in my answer). :) – MSeifert May 09 '17 at 09:30
  • @MSeifert, Sorry, I should have double-checked before posting. I just thought of it when I realized these searches might be repeated, but I forgot you had one like this already. I'll leave it here because it shows how to handle different lengths of target, otherwise I'd drop it. – Matthias Fripp May 09 '17 at 16:42