I have a grammar which says that 'A'
can be replaced with `'A','aa','aA','Aa','AA'. (Sanskrit Grammar to be precise).
I want to split a compound word into its possible components, e.g. 'samADAna'
-> ['sam+ADAna','sama+ADAna']
.
lstrep = [('A',('A','aa','aA','Aa','AA'))]
My dictionary sample is
['sam','sama','ADAna']
The actual dictionary is 450,000 words list.
Optionally replacing a substring python has shown a way to create a list of all possible permutations after replacing the 'A'
at all places.
As can be seen, it would give a 25 member list. After this I use Generic Human's code at How to split text without spaces into list of words? to infer the break in the compound based on my dictionary.
Practically the code has to run 25 times. It is not a big problem at this juncture.
But if my input string was 'samADAnApA'
- the permutations would be 625. Code would have to iter for 625 times. It is a heavy cost on memory and time.
Question - Is there a way by which I can restrict the possible permutations to the words allowable by the dictionary. e.g. the dictionary doesn't have 'samA'
.
Therefore samADAna, samAaDAna, samAADAna
etc would not be included in permutations?
My try:
if __name__=="__main__":
perm = permut(sys.argv[1],lstrep,words) # function permut creates all possible permutations of replacements.
output = []
for mem in perm:
split = infer_spaces(mem) # Code of Generic Human
if split is not False:
output.append(split)
output = sorted(output,key=len)
print output