1

I have an array value that I need to get all possible combinations. Which using itertools.product does easily.

Eg. apple could be elppa, appel, lppae etc.

However the caveat is two fold.

I need to get all combinations of this word with the letters repeated 30 times. Eg. aaaaaaaaaaaaaaaaaaaaaaaaaapple , appleaaaaaaaaaaaaaaaaaaaaapple

Obviously we're working with a gigantic data source here, so when I run tests with eg 6-10 repeats it's reasonable fast (ie. under a minute). When running a test overnight to the power of 30, it suggests the test will take days to complete.

I have played with Numpy, which has commonly been suggested on StackOverflow as a faster/lighter method to use. But I have come unstuck on this, as all the variations I have found, have resulted in scripts killing my machine and using disk space, compared to the slow (too slow for this test), but more efficient itertools.product.

Also I am not understanding how you can pull all this data into a numty array to then calculate the following, without the overhead on the system.

Ultimately.

The point of the exercise is to count how many times the word apple appears in each row of results. But only when it appears once in a row. This would count: aaaaaaaaaaaaaaaaaaaaaaaaaapple This would not: appleaaaaaaaaaaaaaaaaaaaaapple

The below code works without too much strain on the machine, but runs too slowly.

Thanks

import itertools
import time
import numpy as np

apple = ['a','p','l','e']
occurences = 0
line = 0
arr_len = len(apple)
length = 30
squared = arr_len**length

start_time = time.time()

for string in itertools.imap(''.join, itertools.product(apple, repeat=length)):
    line += 1
    if (string.count('apple')==1):
        occurences += 1
        if occurences % 100000 == 0:
            print occurences, ("--- %s seconds ---" % (time.time() - start_time)),squared, line

print ('Occurences : ',occurences)
print ('Last line no. ',line)  
print ("--- %s seconds ---" % (time.time() - start_time))
wingman
  • 127
  • 2
  • 7
  • Does the `for` loop never ends? – gabra Dec 09 '15 at 16:33
  • replacing "string.count('apple')==1" with "'apple' in string " speeds your solution up a fair bit. – greg_data Dec 09 '15 at 16:37
  • Maybe this can help you http://stackoverflow.com/a/4714857/2029132 – gabra Dec 09 '15 at 16:39
  • I don't think this is feasible -- this approach has exponential runtime. Look: when `length = 10`, you "only" have `4**10 = 1,048,576` products to check. When `length = 30`, you have `4**30 = 1,152,921,504,606,846,976`. That's going to take a while no matter what optimizations you implement. – jme Dec 09 '15 at 16:47
  • Just an estimate: on my machine, `length = 10` completes in `0.6` seconds. So, roughly, each comparison takes `0.6 / (4**10)` seconds, or (about) half a microsecond. So the time to compute `4**30` operations is `4**30 * (5e-7) = 18,267 years`. – jme Dec 09 '15 at 16:49
  • @jme: It's totally feasible. See my answer. – Neil G Dec 09 '15 at 16:51
  • 1
    @NeilG Oh, I meant his *approach* isn't feasible. I believe the *solution* can be calculated with pencil and paper. – jme Dec 09 '15 at 16:52
  • 1
    My solution finds 28768047528652794 matches for (apple, 30) – Neil G Dec 09 '15 at 17:16
  • If the word is "papa" then is "papapa" a match or not? I.e., does it have 1 or 2 matches of papa? – Neil G Dec 09 '15 at 17:18

2 Answers2

1

The way you're trying to solve the problem is inherently exponential. You need to use dynamic programming. This problem has a polynomial time solution. If your word has n characters in it, you can use a Markov chain with 2n states.

import numpy as np

word = 'papal'
length = 10

word_chars = list(set(word))
n = len(word)
m = len(word_chars)
states = [0] * (2*n)
states[0] = 1
jumps = np.zeros((n, m), dtype=np.int)
for i in range(n):
    for j in range(m):
        # We've seen the first i characters of word, and we see character word_chars[j]
        if word[i] == word_chars[j]:
            value = i+1
        else:
            for k in range(i+1):
                if word[k: i] + word_chars[j] == word[:i - k + 1]:
                    value = i - k + 1
                    break
            else:
                value = 0
        jumps[i, j] = value

for i in range(length):
   new_states = [0] * (2*n)
    for j in range(n):
        for jump in jumps[j]:
            new_states[jump] += states[j]
            if n+jump < 2*n:
                new_states[n+jump] += states[n+j]
    states = new_states

print(np.sum(states[n:]))

If the word is "papa" then does "papapa" match or not? If not, you should delete states in the Markov chain.

Neil G
  • 32,138
  • 39
  • 156
  • 257
  • To satisfy the query, the example ("papa" > "papapa"), this would count as 1 match, even though there is a case to be made for 2 occurrences. – wingman Dec 09 '15 at 21:50
1

With a little thought we can apply some of the counting techniques from basic probability to calculate the number of sequences containing at most one occurrence of a word. The dynamic programming solution might be easier to come up with, though, and probably runs faster for small sequence size -- the solution below has linear time complexity in the sequence length, but isn't optimized for speed, and I just post it here to serve as a reference:

from scipy.misc import comb

def k(i, t, w):
    """Compute the number of occurrences.

    Arguments
    ---------
    i : int
        The number of products.
    w : int
        Length of the word.
    t : int
        The number of characters in the alphabet.

    """
    # consider all i - w + 1 ways of placing the word into `i` slots,
    # and subtract the number of sequences with multiple occurrences (n_x, n_y)
    r = i - w
    tot = 0
    for x in range(r + 1):
        y = r - x
        n_y = 0 if y < w else (y - w + 1) * t**(y - w)
        n_x = 0 if x < w else (x - w + 1) * t**(x - w)
        s = t**x * n_y + t**y * n_x
        tot += t**r - s

    # for i >= 15 we must compute a correction, because we are "double 
    # counting" some sequences with multiple occurrences. The correction
    # turns out to be an alternating sequence of binomial coefficients
    cor = 0
    for c_k in range(2, i // w):
        c_n = (c_k + 1) + i - w * (c_k + 1)
        cor += (-1)**c_k * int(comb(c_n, c_k)) * n(i - w * c_k)

    return tot + cor

Such that

>>> for i in range(31): print(i, k(i, 4, 5))

0 0
1 0
2 0
3 0
4 0
5 1
6 8
7 48
8 256
9 1280
10 6142
11 28648
12 130880
13 588544
14 2613760
15 11491331
16 50102320
17 216924640
18 933629696
19 3997722880
20 17041629180
21 72361164720
22 306190089280
23 1291609627904
24 5433306572800
25 22798585569285
26 95447339991160
27 398767643035280
28 1662849072252416
29 6921972555609600
30 28768047528652794
jme
  • 19,895
  • 6
  • 41
  • 39