I have a problem about this splitting function. Function basically takes a string, for example word = 'optimization'
and defines it's splitting points with respect to a random number generated and turns that splits into bigrams. The '0'
marker means end-of-word. Consider word below; left side is input and function should give one of all possible outputs with same probability with any output of same word:
'optimization' = [['op', 'ti'], ['ti', 'mizati'], ['mizati', 'on'], ['on', '0']
Problem: When I profiled all my functions, this splitting function is consuming the greatest runtime (processes 100k words), but I'm stuck at optimizing it. I need some help at this point. Also there could be better ways but I'm bounded with my own perspective.
from numpy import mod
import nltk
def random_Bigramsplitter(word):
spw = []
length = len(word)
rand = random_int(word) # produce random number in respect to len(word)
if rand == length: # probability of not dividing
return [tuple([word, '0'])]
else:
div = mod(rand, (length + 1)) # defining division points by mod operation
bound = length-div
spw.append(div)
while div != 0:
rand = random_int(word)
div = mod(rand, (bound + 1))
bound = bound-div
spw.append(div)
result = spw
b = 0
points = []
for x in range(len(result) - 1): # calculating splitting points in respect to array structure
b += result[x]
points.append(b)
xy = 0
t = []
for i in points:
t.append(word[xy:i])
xy = i
if word[xy: len(word)] != '':
t.append(word[xy: len(word)])
t.extend('0')
c = [b for b in nltk.bigrams(t)]
return c