I have an array value that I need to get all possible combinations. Which using itertools.product does easily.
Eg. apple could be elppa, appel, lppae etc.
However the caveat is two fold.
I need to get all combinations of this word with the letters repeated 30 times. Eg. aaaaaaaaaaaaaaaaaaaaaaaaaapple , appleaaaaaaaaaaaaaaaaaaaaapple
Obviously we're working with a gigantic data source here, so when I run tests with eg 6-10 repeats it's reasonable fast (ie. under a minute). When running a test overnight to the power of 30, it suggests the test will take days to complete.
I have played with Numpy, which has commonly been suggested on StackOverflow as a faster/lighter method to use. But I have come unstuck on this, as all the variations I have found, have resulted in scripts killing my machine and using disk space, compared to the slow (too slow for this test), but more efficient itertools.product.
Also I am not understanding how you can pull all this data into a numty array to then calculate the following, without the overhead on the system.
Ultimately.
The point of the exercise is to count how many times the word apple appears in each row of results. But only when it appears once in a row. This would count: aaaaaaaaaaaaaaaaaaaaaaaaaapple This would not: appleaaaaaaaaaaaaaaaaaaaaapple
The below code works without too much strain on the machine, but runs too slowly.
Thanks
import itertools
import time
import numpy as np
apple = ['a','p','l','e']
occurences = 0
line = 0
arr_len = len(apple)
length = 30
squared = arr_len**length
start_time = time.time()
for string in itertools.imap(''.join, itertools.product(apple, repeat=length)):
line += 1
if (string.count('apple')==1):
occurences += 1
if occurences % 100000 == 0:
print occurences, ("--- %s seconds ---" % (time.time() - start_time)),squared, line
print ('Occurences : ',occurences)
print ('Last line no. ',line)
print ("--- %s seconds ---" % (time.time() - start_time))