Detect whether sequence is a multiple of a subsequence in Python

Question

I have a tuple of zeros and ones, for instance:

(1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1)

It turns out:

(1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1) == (1, 0, 1, 1) * 3

I want a function f such that if s is a non-empty tuple of zeros and ones, f(s) is the shortest subtuple r such that s == r * n for some positive integer n.

So for instance,

f( (1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1) ) == (1, 0, 1, 1)

What is a slick way to write the function f in Python?

Edit:

The naive method I am currently using

def f(s):
  for i in range(1,len(s)):
    if len(s)%i == 0 and s == s[:i] * (len(s)/i):
      return s[:i]

What's wrong with your current approach? Seems pretty compact and relatively efficient to me. — David Robinson, Mar 11 '13 at 06:10
@DavidRobinson Nothing wrong per se. But in my program `s` can get fairly large, so I would prefer something a little more efficient, especially if I can get a little more efficiency without coding too hard. — math4tots, Mar 11 '13 at 06:14
What makes you think your approach isn't efficient? In any approach, you'll have to compare all the values in the tuple at some point, and tuple multiplication is very fast (certainly faster than doing a for loop and checking subtuples). The only optimization I can see is changing `range(1, len(s))` to `range(1, len(s) / 2)`. — David Robinson, Mar 11 '13 at 06:16
@DavidRobinson I wasn't sure if there was any easy way to beat O(n^2). And is tuple multiplication really better than checking subtuples? I was also afraid that when `s` gets really large tuple multiplication would build large tuples, even when the comparison fails early on. — math4tots, Mar 11 '13 at 06:28
You should be able to use suffix trees to do it in O(n) (or O(n log n)) time, but I would not call it "slick". What is your definition of slick? I find your naive method slick... — Knoothe, Mar 11 '13 at 06:30
@Knoothe For me 'slick' is when I don't have to work too hard to get code to do what it's supposed to. Although, often I make an exception to this rule for really cool one line solution that I can just copy and paste. — math4tots, Mar 11 '13 at 06:30
@Knoothe So far, it hasn't gotten over `1000000`. So I'm probably overexaggerating on the size of `s`. However, sometimes this function is called millions of times (not always with large input), so efficiency here may be important. — math4tots, Mar 11 '13 at 06:39
Well, a quadratic algorithm of (million)^2, will likely kill the perf... very interesting problem. — Knoothe, Mar 11 '13 at 06:46
@Knoothe I guess it is not a quadratic algorithm because N has, on average, log(N) divisors. The right complexity is between O(N*logN) and O(N^2). I guess `max([number_of_divisors(x) for x in xrange(2, 1000000)])` is less then 250. — msbrogli, Mar 11 '13 at 06:58
Incidentally, turning the tuple into a string (using `"".join(map(str, s))` actually speeds it up *far* beyond many of the below solutions- strings take up less memory than tuples of integers and are easier to manipulate in certain situations. However, shx2's solution still beats that approach, due mainly to the overhead of constructing the string. — David Robinson, Mar 11 '13 at 06:58
@msbrogli: Yes, on an average, it probably isn't. You are right. — Knoothe, Mar 11 '13 at 07:12
@math4tots: I believe I have a fast solution (provably linear, no more than 2 + fraction passes over the string). Hope it works for you. I have edited my answer to put in the code which will return the prefix. (I have it as a list, but a tuple should also work). — Knoothe, Mar 11 '13 at 07:38

Knoothe · Accepted Answer · 2013-03-11T08:12:30.783

I believe I have an O(n) time solution (actually 2n+r, n is length of tuple, r is sub tuplle) which does not use suffix trees, but uses a string matching algorithm (like KMP, which you should find off-the shelf).

We use the following little known theorem:

If x,y are strings over some alphabet,

then xy = yx if and only if x = z^k and y = z^l for some string z and integers k,l.

I now claim that, for the purposes of our problem, this means that all we need to do is determine if the given tuple/list (or string) is a cyclic shift of itself!

To determine if a string is a cyclic shift of itself, we concatenate it with itself (it does not even have to be a real concat, just a virtual one will do) and check for a substring match (with itself).

For a proof of that, suppose the string is a cyclic shift of itself.

The we have that the given string y = uv = vu. Since uv = vu, we must have that u = z^k and v= z^l and hence y = z^{k+l} from the above theorem. The other direction is easy to prove.

Here is the python code. The method is called powercheck.

def powercheck(lst):
    count = 0
    position = 0
    for pos in KnuthMorrisPratt(double(lst), lst):
        count += 1
        position = pos
        if count == 2:
            break

    return lst[:position]


def double(lst):
    for i in range(1,3):
        for elem in lst:
            yield elem

def main():
    print powercheck([1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1])

if __name__ == "__main__":
    main()

And here is the KMP code which I used (due to David Eppstein).

# Knuth-Morris-Pratt string matching
# David Eppstein, UC Irvine, 1 Mar 2002

def KnuthMorrisPratt(text, pattern):

    '''Yields all starting positions of copies of the pattern in the text.
Calling conventions are similar to string.find, but its arguments can be
lists or iterators, not just strings, it returns all matches, not just
the first one, and it does not need the whole text in memory at once.
Whenever it yields, it will have read the text exactly up to and including
the match that caused the yield.'''

    # allow indexing into pattern and protect against change during yield
    pattern = list(pattern)

    # build table of shift amounts
    shifts = [1] * (len(pattern) + 1)
    shift = 1
    for pos in range(len(pattern)):
        while shift <= pos and pattern[pos] != pattern[pos-shift]:
            shift += shifts[pos-shift]
        shifts[pos+1] = shift

    # do the actual search
    startPos = 0
    matchLen = 0
    for c in text:
        while matchLen == len(pattern) or \
              matchLen >= 0 and pattern[matchLen] != c:
            startPos += shifts[matchLen]
            matchLen -= shifts[matchLen]
        matchLen += 1
        if matchLen == len(pattern):
            yield startPos

For your sample this outputs

[1,0,1,1]

as expected.

I compared this against shx2's code(not the numpy one), by generating a random 50 bit string, then replication to make the total length as 1 million. This was the output (the decimal number is the output of time.time())

1362988461.75

(50, [1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1])

1362988465.96

50 [1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1]

1362988487.14

The above method took ~4 seconds, while shx2's method took ~21 seconds!

Here was the timing code. (shx2's method was called powercheck2).

def rand_bitstring(n):
    rand = random.SystemRandom()
    lst = []
    for j in range(1, n+1):
        r = rand.randint(1,2)
        if r == 2:
            lst.append(0)
        else:
            lst.append(1)

    return lst

def main():
    lst = rand_bitstring(50)*200000
    print time.time()
    print powercheck(lst)
    print time.time()
    powercheck2(lst)
    print time.time()

That is a pretty cool approach. Practically it also seems to be one of the faster methods. However, I do have `numpy` available, and it looks as though `shx2`'s `numpy` solution works best for my usecase. I don't think converting the KMP approach to use numpy will really gain much for my application. — math4tots, Mar 11 '13 at 07:59
@math4tots: let me try comparing the numpy solution with this. Alas, I don't seem to have numpy installed with me right now... perhaps later. — Knoothe, Mar 11 '13 at 08:00
@math4tots: the numpy solution is worst case quadratic too I believe... Care to elaborate on your use case? (I am going by the million number...) btw, have you actually tried comparing? I would like to, but I need to install numpy. Perhaps someone else would be kind enough to compare the numpy one with this... — Knoothe, Mar 11 '13 at 08:04
My program calculates periodic solutions (a sequence of zeros and ones) that satisfy a specific criteria where solutions are equivalent up to multiplication. At this point, this part of the program doesn't really seem like the bottleneck, so I'm a little reluctant to push further until I have gotten the rest of my program up to par. — math4tots, Mar 11 '13 at 08:14
@math4tots: Yeah, was just curious. You know best what works for you :-) (And I was also curious for my own sake how this compares to the numpy solution, and was not trying to convince you or anything). Thank you for the very interesting problem. — Knoothe, Mar 11 '13 at 08:17
@Knoothe, yes my numpy solution is quadratic. Its advantage is that it uses bitwise operation and (numpy-optimized) c-space loops, which makes each iteration super fast. In practice, this often makes it perform better than theoretically superior solution (which yours is, without a doubt). — shx2, Mar 11 '13 at 08:30
@shx2: Yeah, but in this case I believe the theoretical solution is practical too (unlike other theoretical solutions :-)), and that is why I was curious for the comparison. Of course, only math4tots really knows which works best for him. — Knoothe, Mar 11 '13 at 19:37

shx2 · Answer 2 · 2013-03-11T07:20:25.420

The following solution is O(N^2), but has the advantage of not creating any copies (or slices) of your data, as it is based on iterators.

Depending on the size of your input, the fact you avoid making copies of the data can result in a significant speed-up, but of course, it would not scale as well for huge inputs as algorithms with lower complexity (e.g. O(N*logN)).

[This is the second revision of my solution, the first one is given below. This one is simpler to understand, and is more along the lines of OP's tuple-multiplication, only using iterators.]

from itertools import izip, chain, tee

def iter_eq(seq1, seq2):
    """ assumes the sequences have the same len """
    return all( v1 == v2 for v1, v2 in izip(seq1, seq2) )

def dup_seq(seq, n):
    """ returns an iterator which is seq chained to itself n times """
    return chain(*tee(seq, n))

def is_reps(arr, slice_size):
    if len(arr) % slice_size != 0:
        return False
    num_slices = len(arr) / slice_size
    return iter_eq(arr, dup_seq(arr[:slice_size], num_slices))

s = (1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1)
for i in range(1,len(s)):
    if is_reps(s, i):
        print i, s[:i]
        break

[My original solution]

from itertools import islice

def is_reps(arr, num_slices):
    if len(arr) % num_slices != 0:
        return False
    slice_size = len(arr) / num_slices
    for i in xrange(slice_size):
        if len(set( islice(arr, i, None, num_slices) )) > 1:
            return False
    return True

s = (1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1)
for i in range(1,len(s)):
    if is_reps(s, i):
        print i, s[:i]
        break

You can avoid the call to set() by using something like:

def is_iter_unique(seq):
    """ a faster version of testing len(set(seq)) <= 1 """
    seen = set()
    for x in seq:
        seen.add(x)
        if len(seen) > 1:
            return False
    return True

and replacing this line:

if len(set( islice(arr, i, None, num_slices) )) > 1:

with:

if not is_iter_unique(islice(arr, i, None, num_slices)):

When run on the OP's example, this takes ~4 times as long as his naive code. — David Robinson, Mar 11 '13 at 06:39
(And the change to the use of `is_iter_unique` doubles it beyond that). — David Robinson, Mar 11 '13 at 06:43
*However*, this method does by far outstrip the others once the size of the list is greater than 1000 (which the OP has indicated is the appropriate range on which to consider this problem)! Well done! — David Robinson, Mar 11 '13 at 06:47
@DavidRobinson: I believe my solutions beats this method! See my answer. — Knoothe, Mar 11 '13 at 07:59

Juan Lopes · Answer 3 · 2013-03-11T20:15:38.563

3

Simplifying Knoothe's solution. His algorithm is right, but his implementation is too complex. This implementation is also O(n).

Since your array is only composed of ones and zeros, what I do is use existing str.find implementation (Bayer Moore) to implement Knoothe's idea. It's suprisingly simpler and amazingly faster at runtime.

def f(s):
    s2 = ''.join(map(str, s))
    return s[:(s2+s2).index(s2, 1)]

edited Mar 11 '13 at 20:15

answered Mar 11 '13 at 20:03

Juan Lopes

10,143
2
25
44

Awesome, didn't know this fact about str.find. Now this is what I call slick :-) +10 if I could. – Knoothe Mar 11 '13 at 21:49
@math4tots: Did you get a chance to see this one? If you are willing to accept my answer, then this should be the one to accept (I don't mind, if that is what prompted you). – Knoothe Mar 12 '13 at 16:09

score 1 · Answer 4 · answered Mar 11 '13 at 07:25

Here's another solution (competing with my earlier iterators-based solution), leveraging numpy.

It does make a (single) copy of your data, but taking advantage of the fact your values are 0s and 1s, it is super-fast, thanks to numpy's magics.

import numpy as np

def is_reps(arr, slice_size):
    if len(arr) % slice_size != 0:
        return False
    arr = arr.reshape((-1, slice_size))
    return (arr.all(axis=0) | (~arr).all(axis=0)).all()

s = (1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1) * 1000
a = np.array(s, dtype=bool)
for i in range(1,len(s)):
    if is_reps(a, i):
        print i, s[:i]
        break

Abhijit · Answer 5 · 2013-03-11T06:53:32.043

0

Just a different approach to the problem

I first determine all the factors of the length and then split the list and check if all the parts are same

>>> def f(s):
    def factors(n):
        #http://stackoverflow.com/a/6800214/977038
        return set(reduce(list.__add__,
                ([i, n//i] for i in range(2, int(n**0.5) + 1) if n % i == 0)))
    _len = len(s)
    for fact in reversed(list(factors(_len))):
        compare_set = set(izip(*[iter(s)]*fact))
        if len(compare_set) == 1:
            return compare_set


>>> f(t)
set([(1, 0, 1, 1)])

edited Mar 11 '13 at 06:53

answered Mar 11 '13 at 06:20

Abhijit

62,056
18
131
204

This doesn't achieve the desired result at all- it checks only the first two repetitions of the subvector. `f((1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1))` would incorrectly return `(1, 1, 0, 0)`. – David Robinson Mar 11 '13 at 06:23
@DavidRobinson: Rewrote my answer. :-) – Abhijit Mar 11 '13 at 06:54

K Z · Answer 6 · 2013-03-11T07:51:18.703

0

You can archive it in sublinear time by XOR'ing the rotated binary form for the input array:

get the binary representation of the array, input_binary
loop from i = 1 to len(input_array)/2, and for each loop, rotate the input_binary to the right by i bits, save it as rotated_bin, then compare the XOR of rotated_bin and input_binary.
The first i that yields 0, is the index to which is the desired substring.

Complete code:

def get_substring(arr):
    binary = ''.join(map(str, arr)) # join the elements to get the binary form

    for i in xrange(1, len(arr) / 2):
        # do a i bit rotation shift, get bit string sub_bin
        rotated_bin = binary[-i:] + binary[:-i]
        if int(rotated_bin) ^ int(binary) == 0:
            return arr[0:i]

    return None


if __name__ == "__main__":
    test = [1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1]
    print get_substring(test) # [1,0,1,1]

edited Mar 11 '13 at 07:51

answered Mar 11 '13 at 07:07

K Z

29,661
8
73
78

I am not sure how you can claim sub-linear. Python will switch to big integers at some point, won't it? – Knoothe Mar 11 '13 at 07:11
@Knoothe You only need to compare at most len(array)/2 times, and hence sublinear. You can use [https://pypi.python.org/pypi/bitarray/](https://pypi.python.org/pypi/bitarray/) if that's your concern. – K Z Mar 11 '13 at 07:15
1

len(array)/2 is still linear by definition. – Knoothe Mar 11 '13 at 07:17
@Knoothe I understand exactly where you are coming from -- what I'm trying to emphasis here is that it depends more on the **length of the recurrent subsequence**, rather than the length of the original sequence. And in worst case, it is strictly < n time (yes in big-O notation it is stil `O(n)` in worst case). – K Z Mar 11 '13 at 07:20
Aren't you also assuming the arithmetic is O(1)? Which won't be, if we start using big integers? – Knoothe Mar 11 '13 at 07:34
@Knoothe `XOR` is considered a `O(1)` operation algorithmically, as it is hardware supported; but in reality yes, it depends on the size of register and memory. – K Z Mar 11 '13 at 07:44
That is not my point. My point is, if you have an integer with million bits, you cannot possible claim to have hardware support for arithmetic operations on it. – Knoothe Mar 11 '13 at 08:05
@Knoothe Agreed -- we don't yet, but we **can** easily parallelize XOR among a clusters and retain O(1) speed. That is, if we want to discuss theoretical running time regardless of hardware, it is still a `O(1)` operation imo. – K Z Mar 11 '13 at 08:10
You are now making assumptions about what resources OP has? Even so, I am curious, how are you going to parallelize your whole algorithm? – Knoothe Mar 11 '13 at 08:11
@Knoothe I wasn't -- you were. I just stated that `XOR` is regarded as a `O(1)` operation if we disregard hardware. And I stated clearly that in reality this is not always possible. – K Z Mar 11 '13 at 08:13
1

Not sure what I said has caused this misunderstanding. XOR is regarded O(1) only if the ints fit in O(1) registers/words! That is not the case here (see the comments by OP to the question). Anyway, seems like we are going around in circles. I will stop wasting your time. Thanks. – Knoothe Mar 11 '13 at 08:15
@Knoothe I agreed -- see my above comments. And I do appreciate reading your solution, I've learn things from it :) As for parallelizing XOR, this is just one of the papers I have in mind: http://www.cs.toronto.edu/~norouzi/research/papers/multi_index_hashing.pdf – K Z Mar 11 '13 at 08:17
Thanks! Parallelizing XOR is fine, but that isn't the only aspect of your algorithm that needs to parallelized, is it? – Knoothe Mar 11 '13 at 08:18

גלעד ברקן · Answer 7 · 2013-03-12T01:50:54.373

This one is just a dumb recursive comparison in Haskell. It takes about one second for Knoothe's million long string (f a). Cool problem! I'll think about it some more.

a = concat $ replicate 20000 
    [1,1,1,0,0,1,0,1,0,0,1,0,0,1,1,1,0,0,
     0,0,0,0,1,1,1,1,0,0,0,1,1,0,1,1,1,1,
     1,1,1,0,0,1,1,1,0,0,0,0,0,1]

f s = 
  f' s [] where
    f' [] result = []
    f' (x:xs) result =
      let y = result ++ [x]   
      in if concat (replicate (div (length s) (length y)) y) == s
            then y
            else f' xs y

Detect whether sequence is a multiple of a subsequence in Python

7 Answers7

Linked