1

In a comparison function I am bascially looking for a pattern (e.g. "AAA") inside a long binary object (for an example, aaaAAAbbbBBB)

I'm working backwards through the file (I know the match will be closer to the end than beginning), an adding 1 byte to the variable that is being checked for the match:

1. aaaAAAbbbBB[B]
2. aaaAAAbbbB[BB]
3. aaaAAAbbb[BBB]
4. aaaAAAbb[bBBB]
5. ... 
n. aaa[AAAbbbBBB] 

match found, offset = -n

Given that I know my pattern is 3 elements long, I wondered if I can simply window the search variable rather than incrementing it - it gets very slow when the match is +1,000,000 elements deep in the list - windowed view of the same data would be:

1. aaaAAAbbb[BBB]
2. aaaAAAbb[bBB]B
3. aaaAAAb[bbB]BB
4. aaaAAA[bbb]BBB
5. ...
n. aaa[AAA]bbbBBB

match found, offset = -n

My current search looks like:

if marker in f_data[-counter:]:
    offset = (len(f_data)-counter)+len(marker)
    return offset

In MATLAB I would have used the array addressing to move through the array,(e.g. calling window = a[5:8], window = a[4:7] etc) but I don't think that's possible in Python (2.7)

I can see a few suggestions for using a sliding window, ( Rolling or sliding window iterator in Python - this looks like a close match) but I can't see how to implement it or they reference libs that I don't know how to use.

Is there a built in function for doing this?

Community
  • 1
  • 1
Jay Gattuso
  • 3,890
  • 12
  • 37
  • 51

3 Answers3

5

Why not just use rfind() or rindex()?

haystack = "aaaAAAbbbBBB"
needle   = "AAA"

pos = haystack.rfind(needle)

if pos >= 0:
    print "found at", pos - len(haystack)
else:
    print "not found"
kindall
  • 178,883
  • 35
  • 278
  • 309
  • Oh, that's useful, thank you. There might be n needles in the haystack, but I only need the last one - how would you cope with multiple matches? – Jay Gattuso Sep 26 '12 at 01:57
  • 2
    @JayGattuso `rfind` searches from the right. If you need the leftmost match, use `find`. – Marcin Sep 26 '12 at 02:02
0

Two things:

(1) the standard string type holds bytes, and you can use regexes with this. I could suggest that you slurp the object into a string, and perform a regex search.

(2) If you do want to do it the hard way, there's http://docs.python.org/library/itertools.html#itertools.groupby

Marcin
  • 48,559
  • 18
  • 128
  • 201
  • Thank you, I did think about RegEx but I have multiple match cases, and I only need the last one, at the time (before I know how deep the match would be) my approach seemed reasonable. I suspect I am going about things the wrong way!.. – Jay Gattuso Sep 26 '12 at 02:00
  • @JayGattuso I don't see why needing only the last one precludes a regex match, or indeed `rfind`. – Marcin Sep 26 '12 at 02:02
  • the preclusion is more me not knowing how to handle a multiple match scenario rather than an issue with the tools themselves... :) – Jay Gattuso Sep 26 '12 at 02:10
0

I think this makes use of the window() iterator function you mentioned.

>>> l = "ABCABACAAASSD"
>>> from itertools import islice
>>>
>>> def window(seq, n=2):
...     "Returns a sliding window (of width n) over data from the iterable"
...     "   s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...                   "
...     it = iter(seq)
...     result = tuple(islice(it, n))
...     if len(result) == n:
...         yield result
...     for elem in it:
...         result = result[1:] + (elem,)
...         yield result
...
>>>
>>> data = [c for c in l] # get each byte/charactor as separate item in list
>>> data
['A', 'B', 'C', 'A', 'B', 'A', 'C', 'A', 'A', 'A', 'S', 'S', 'D']
>>> for idx, elements in enumerate(window(reversed(data), n=3)):
...     section = "".join(elements)
...     if section == "AAA":
...         print "found at {}!".format(idx)
...
found at 3!
>>>

To explain:

  • reversed() takes in a list and returns an iterator with the elements in reverse order
  • window() takes in an iterable object (list, tuple, iterator) and returns n number of elements, shifting the index 1 element at a time.
  • enumerate() takes an iterable and simply attaches a counter, so it will return the counter/position and the given element item.
monkut
  • 42,176
  • 24
  • 124
  • 155
  • Thank you, one day I hope to meet code like that and understand it!... I suspect at the root I am going about this process the wrong way. – Jay Gattuso Sep 26 '12 at 01:59