2

I have a problem about this splitting function. Function basically takes a string, for example word = 'optimization' and defines it's splitting points with respect to a random number generated and turns that splits into bigrams. The '0' marker means end-of-word. Consider word below; left side is input and function should give one of all possible outputs with same probability with any output of same word:

'optimization' = [['op', 'ti'], ['ti', 'mizati'], ['mizati', 'on'], ['on', '0']

Problem: When I profiled all my functions, this splitting function is consuming the greatest runtime (processes 100k words), but I'm stuck at optimizing it. I need some help at this point. Also there could be better ways but I'm bounded with my own perspective.

from numpy import mod
import nltk   

def random_Bigramsplitter(word):
    spw = []
    length = len(word)
    rand = random_int(word)  # produce random number in respect to len(word)

    if rand == length:  # probability of not dividing
        return [tuple([word, '0'])]
    else:
        div = mod(rand, (length + 1))  # defining division points by mod operation
        bound = length-div
        spw.append(div)
        while div != 0:
            rand = random_int(word)
            div = mod(rand, (bound + 1))
            bound = bound-div
            spw.append(div)
        result = spw

    b = 0
    points = []
    for x in range(len(result) - 1):  # calculating splitting points in respect to array structure
        b += result[x]
        points.append(b)

    xy = 0
    t = []
    for i in points:
        t.append(word[xy:i])
        xy = i

    if word[xy: len(word)] != '':
        t.append(word[xy: len(word)])

    t.extend('0')
    c = [b for b in nltk.bigrams(t)]

    return c
colidyre
  • 4,170
  • 12
  • 37
  • 53
  • For me it's not very clear if the [NLTK bigram maker](https://github.com/nltk/nltk/blob/develop/nltk/util.py#L383) is fast enough for your needs. Maybe it's better to write an own bigram splitter. – colidyre Oct 02 '15 at 12:51
  • 1
    @LetzerWille: I assume a function like this: `from random import choice; def random_int(word): return choice(range(len(word))) + 1` – colidyre Oct 02 '15 at 12:52

2 Answers2

1

You can replace

c = [b for b in nltk.bigrams(t)]

with

def get_ngram(word, n):
    return zip(*[word[i:] for i in xrange(n)])

c = [b for b in get_ngram(t, 2)]

this seems to be faster. I do not claim that this is the fastest solution.

There are more answers for optimizing your bigram speed. This seems to be a good starting point: Fast n-gram calculation, my code snippet is from: http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/

Community
  • 1
  • 1
colidyre
  • 4,170
  • 12
  • 37
  • 53
  • @Serkan Kumyol - made any progress with that? The `c=[b for b in nltk.bigrams(t)]` line was a time killer as already shown by LetzerWille. – colidyre Oct 04 '15 at 18:35
  • I'm rewriting whole code it maybe better to divide string according to ascıı correspondence. I also tried line profiler and my random_int should definitely changed and maybe I can get rid of while because it's most costly expression as I just learned. – serkan maestro Oct 05 '15 at 07:14
  • @Serkan Kumyol - Then please either edit your question according to your new code or upvote helpful answers or even mark them as a solution that worked for you to indicate that you made at least progress. If you edit your question, I would recommend also provide the code for `random_int()`. – colidyre Oct 05 '15 at 08:18
0
here is profile of you function with line_profiler and 1000 runs 
I do not see what could be optimized significantly, it seems that


Function: random_Bigramsplitter at line 10

Line       Hits         Time  Per Hit   % Time  Line Contents
==============================================================

10                                           def random_Bigramsplitter(word):
11                                           
12      1000        30037     30.0      1.2      spw=[]
13      1000        30107     30.1      1.2      length=len(word)
14      1000       165263    165.3      6.8      rand=random_int(word)
15      1000        31969     32.0      1.3      if rand==length:
16        87         4361     50.1      0.2          return [tuple([word,'0'])]
17                                           
18                                               else:
19       913        84378     92.4      3.5          div=mod(rand,(length+1))
20       913        37940     41.6      1.6          bound=length-div
21       913        33985     37.2      1.4          spw.append(div)
22      3316       122185     36.8      5.0          while div!=0:
23      2403       379946    158.1     15.7              rand=random_int(word)
24      2403       201558     83.9      8.3              div=mod(rand,(bound+1))
25      2403        81872     34.1      3.4              bound=bound-div
26      2403        86954     36.2      3.6              spw.append(div)
27       913        27342     29.9      1.1          result=spw
28       913        26054     28.5      1.1      b=0
29       913        27139     29.7      1.1      points=[]
30      3316       118538     35.7      4.9      for x in range(len(result)-1):
31      2403        90216     37.5      3.7          b=b+result[x]
32      2403        82404     34.3      3.4          points.append(b)
33       913        26537     29.1      1.1      xy=0
34       913        26754     29.3      1.1      t=[]
35      3316       101717     30.7      4.2      for i in points:
36      2403        95277     39.6      3.9          t.append(word[xy:i])
37      2403        71435     29.7      2.9          xy=i
38       913        35714     39.1      1.5      if word[xy:len(word)]!='':
39       467        18571     39.8      0.8          t.append(word[xy:len(word)])
40       913        36953     40.5      1.5      t.extend('0')
41       913       323568    354.4     13.3      c=[b for b in nltk.bigrams(t)]
42       913        27741     30.4      1.1      return c
LetzerWille
  • 5,355
  • 4
  • 23
  • 26