break compound based on dictionary

Question

I have a grammar which says that 'A' can be replaced with `'A','aa','aA','Aa','AA'. (Sanskrit Grammar to be precise).

I want to split a compound word into its possible components, e.g. 'samADAna' -> ['sam+ADAna','sama+ADAna'].

lstrep = [('A',('A','aa','aA','Aa','AA'))]

My dictionary sample is

['sam','sama','ADAna']

The actual dictionary is 450,000 words list.

Optionally replacing a substring python has shown a way to create a list of all possible permutations after replacing the 'A' at all places.

As can be seen, it would give a 25 member list. After this I use Generic Human's code at How to split text without spaces into list of words? to infer the break in the compound based on my dictionary.

Practically the code has to run 25 times. It is not a big problem at this juncture.

But if my input string was 'samADAnApA' - the permutations would be 625. Code would have to iter for 625 times. It is a heavy cost on memory and time.

Question - Is there a way by which I can restrict the possible permutations to the words allowable by the dictionary. e.g. the dictionary doesn't have 'samA'.

Therefore samADAna, samAaDAna, samAADAna etc would not be included in permutations?

My try:

if __name__=="__main__":
    perm = permut(sys.argv[1],lstrep,words) # function permut creates all possible permutations of replacements.
    output = []
    for mem in perm:
        split = infer_spaces(mem) # Code of Generic Human 
        if split is not False:
            output.append(split)
    output = sorted(output,key=len)
    print output

score 0 · Answer 1 · answered Dec 07 '15 at 07:09

I think that you are trying to implement the divide rules from joining in Sanskrit. (sandhi-vichchhed)

Now, as far as I remember, there are only a set of rules (and we are not talking samas here, only sandhi). Like here, you are talking of:

a/A + a/A = A      //pronounce as if this is hindi/sanskrit

So, I really don't think you need to split words in all ways possible. Pick up the LHS of your splitting equation, eg. You are splitting the words at A and the new words are such that last character of new word is a/A and 1st of next is a/A. (Or maybe in this case, last character of 1st word is also not a, due to (small a) sound). Now you need to see the dictionary and see the possible combinations. This can be really ambiguous as I'm not really sure how will you distinguish between, lets say, sam and sama in this case.

So, you actually don't need to split the words in all possible ways, only where sandhi -rules apply, and the split words have their existence in the dictionary.

Like here,

word = samAdhAna
possible splits:
    sam + adhana      --
    sama + adhana     --
    samA + adhana
    sam + Adhana
    sama + Adhana
    samA + Adhana
    samadh + ana
    samadha + ana
    samadhA + ana
    samadh + Ana
    samadha + Ana
    samadhA + Ana

Now, the splits marked with -- are the ones where both split-words are present in the dictionary. So, in this example, you actually need to consider 12 cases in total.

break compound based on dictionary

1 Answers1