How to do Byte Pair Encoding bigram counting and replacements efficiently in Python?

Question

In the Byte Pair Encoding algorithm, there's a replacement step where it changes the character strings delimited by spaces to bigrams.

I.e., given a list of str tuples as such:

[('t', 'h', 'i', 's', '\ue000'), ('c', 'o', 'r', 'p', 'u', 's', '\ue000'), ('i', 'n', '\ue000'), ('t', 'x', 't', 'f', 'i', 'l', 'e', '\ue000'), ('t', 'h', 'e', '\ue000'), ('s', 'e', 'n', 't', 'e', 'n', 'c', 'e', '\ue000'), ('b', 'a', 'r', '\ue000'), ('a', 'n', 'd', '\ue000'), ('i', 's', '\ue000'), ('f', 'o', 'o', '\ue000'), ('f', 'i', 'r', 's', 't', '\ue000'), ('a', '\ue000'), ('.', '\ue000')]

And a string tuple: ('i', 's')

How do I process the list such that it iterates through all the tuple keys and and replace ('i', 's') with ('is')?, i.e. the output Counter will look something like this:

[('t', 'h', 'is', '\ue000'), ('c', 'o', 'r', 'p', 'u', 's', '\ue000'), ('i', 'n', '\ue000'), ('t', 'x', 't', 'f', 'i', 'l', 'e', '\ue000'), ('t', 'h', 'e', '\ue000'), ('s', 'e', 'n', 't', 'e', 'n', 'c', 'e', '\ue000'), ('b', 'a', 'r', '\ue000'), ('a', 'n', 'd', '\ue000'), ('is', '\ue000'), ('f', 'o', 'o', '\ue000'), ('f', 'i', 'r', 's', 't', '\ue000'), ('a', '\ue000'), ('.', '\ue000')]

I've tried this:

>>> cin
[('t', 'h', 'i', 's', '\ue000'), ('c', 'o', 'r', 'p', 'u', 's', '\ue000'), ('i', 'n', '\ue000'), ('t', 'x', 't', 'f', 'i', 'l', 'e', '\ue000'), ('t', 'h', 'e', '\ue000'), ('s', 'e', 'n', 't', 'e', 'n', 'c', 'e', '\ue000'), ('b', 'a', 'r', '\ue000'), ('a', 'n', 'd', '\ue000'), ('i', 's', '\ue000'), ('f', 'o', 'o', '\ue000'), ('f', 'i', 'r', 's', 't', '\ue000'), ('a', '\ue000'), ('.', '\ue000')]
>>> [tuple(' '.join(i).replace(' '.join(qtuple), ''.join(qtuple)).split()) for i in cin]
[('t', 'h', 'is', '\ue000'), ('c', 'o', 'r', 'p', 'u', 's', '\ue000'), ('i', 'n', '\ue000'), ('t', 'x', 't', 'f', 'i', 'l', 'e', '\ue000'), ('t', 'h', 'e', '\ue000'), ('s', 'e', 'n', 't', 'e', 'n', 'c', 'e', '\ue000'), ('b', 'a', 'r', '\ue000'), ('a', 'n', 'd', '\ue000'), ('is', '\ue000'), ('f', 'o', 'o', '\ue000'), ('f', 'i', 'r', 's', 't', '\ue000'), ('a', '\ue000'), ('.', '\ue000')]

but is there a more efficient way than looping through each word, then changing them to string to do a replace and splitting them again and then casting them back into tuples?

Would regex replacement be faster? Is there a way to work with the list of tuples without dealing with strings?

I've tried this and it seems like replacing the string with str.replace is not the problem. It's really counting the bigrams and extracting them:

import io
from collections import Counter

import time

infile = 'big.txt' # comes from norvig.com/big.txt

n = 2
with io.open(infile, 'r', encoding='utf8') as fin:
    text = fin.read().lower().replace(u' ', u"\uE000")
    for j in range(1,6400):
        unused_char = unichr(ord(u'\uE001') + j)

        start = time.time()
        char_bigrams = zip(*[text[i:] for i in range(n)])
        bigram_time = time.time() - start

        start = time.time()
        most_freq_bigram = Counter(filter(lambda x: u"\uE000" not in x and '\n' not in x, char_bigrams)).most_common(1)[0][0]
        max_time = time.time() - start

        start = time.time()
        text = text.replace(''.join(most_freq_bigram), unused_char)
        replace_time = time.time() - start
        print j, ''.join(most_freq_bigram), most_freq_bigram, bigram_time, max_time, replace_time
    print text

This is tested on norvig.com/big.txt

[out]:

1 th (u't', u'h') 0.896255016327 3.28389787674 0.0253069400787
2 e (u'\ue002', u'e') 1.47053217888 3.16544914246 0.0280749797821
3 in (u'i', u'n') 1.13404297829 3.10529899597 0.0245559215546
4 an (u'a', u'n') 1.20013689995 3.63801002502 0.0242891311646
5 er (u'e', u'r') 1.41387891769 3.13376092911 0.0237591266632
6 on (u'o', u'n') 1.22826981544 3.06997895241 0.0227301120758
7 re (u'r', u'e') 1.21916294098 2.97599196434 0.0238041877747
8 at (u'a', u't') 1.14608097076 2.97988891602 0.0226521492004
9 en (u'e', u'n') 1.20747494698 2.88649988174 0.019054889679
10 ed (u'e', u'd') 1.16296696663 2.8995718956 0.0198271274567
11 is (u'i', u's') 1.17692494392 3.02292394638 0.0228500366211
12 d (u'\ue005', u'd') 1.13779211044 2.85169506073 0.0229239463806

I've already experimented with scikit-learn CountVectorizer and i didn't seem to be as fast as using zip, see Fast/Optimize N-gram implementations in python

Also, without them filter operation in the Counter step, it took even longer. The Counter operation is taking 3 seconds per iteration =(

How else can this operation be optimized?

Counter(filter(lambda x: u"\uE000" not in x and '\n' not in x, char_bigrams)).most_common(1)[0][0]

That seems like an inconvenient data structure to use for this algorithm. Is there some reason you can't leave it as a string? — intrepidhero, Oct 30 '16 at 23:45
Using string is fine but keep the byte order of unicodes after the merge operations might be a little tricky. — alvas, Oct 31 '16 at 01:35
What is special about the unicode characters separating the words? Byte pair encoding is all about eliminating common byte pairs. The most common I see in your test string is \ue000. — intrepidhero, Oct 31 '16 at 20:08

score 5 · Answer 1 · edited Jun 20 '20 at 09:12

Your original code:

[tuple(' '.join(i).replace(' '.join(qtuple), ''.join(qtuple)).split()) for i in cin]

I'll expand it so it's easier to see what's happening

result = []
qtuple = ("i", "s")
for i in cin:
    f = " ".join(qtuple)
    r = "".join(qtuple)
    word = ' '.join(i)
    word = word.replace(f, r)
    word = word.split()
    word = tuple(word)
    result.append(word)
print(result)

Look for things you can move outside of the loop. We can precompute the replacements instead of computing them for each word

find = " ".join(qtuple)
replacement = "".join(qtuple)
result = []
# this will join and split each word once
for i in cin:
    word = " ".join(i)
    # if you had multiple replacements to do, they should be in an inner loop here
    word = word.replace(find, replacement)
    result.append(tuple(word.split(" ")))
print(result)

Perhaps someone else can speak to the relatively efficiency of str.replace versus re.replace. Personally I tend to avoid regular expressions if a simple replace will do it, just for readability.

Further efficiency gains can be realized by changing the data structure for the input. If the replacement symbols are single characters then we can use a string instead of a list of tuples and avoid any joins inside the loop.

result = []
replacements = [("\ue000", "X"), ("is", "Z")]
s = "".join(["".join(t) for t in cin])
for f, r in replacements:
    s = s.replace(f,r)
print(s)

# output: thZXcorpusXinXtxtfileXtheXsentenceXbarXandXZXfooXfirstXaX.X

I think the question needs some requirements added to it to explain why the chosen data structure is advantageous. From an efficiency point of view, and in the context of the byte pair encoding algorithm, a string makes a lot more sense to me.

score 4 · Accepted Answer · answered Nov 01 '16 at 19:14

If you keep your string tuple to length 2 you could use reduce like this:

def cons_2(word_list, t):
    j = ''.join(t)
    f = lambda acc, e: acc[:-1] + (j,) if (acc[-1] == t[0] and e == t[1]) else acc + (e,)
    return [reduce(f, i[1:], (i[0],)) for i in word_list]

print cons_2(cin, ('i', 's'))

No replacing is involved, f is applied for every element i, the value of cin is not altered instead a new array is made and returned.

Details:

reduce applies f for every array element i and returns a value to the accumulator acc.
parameters of reduce:
- f: function to apply.
- i[1:]: the array to iterate with all the elements but the first.
- (i[0],): the initial value of the accumulator, it's a tuple with the first value of the input tuple i.
f: is a lambda function with the accumulator acc and the current element e as inputs:
- If the last element of the accumulator is equal to the first element of the string tuple and the current element e is equal to the second element of the string tuple, then return the tuple: acc[-1] + (j,) else continue with a normal concatenation: acc + (e,).

For string tuples > 2 the idea is the same but we have to manage the tuple's length l.

def cons_n(word_list, t):
    l = len(t)
    j = ''.join(t)
    f = lambda acc, e: acc[:-l] + (j, e,) if acc[-l:] == t or acc[:l] == t else acc + (e,)
    return [reduce(f, i[l:], (i[:l])) for i in word_list]

print cons_n(cin, ('i', 's'))

This should work with n-length string tuples.

Details:

Same process as above but using l: reduce applies f to the rest of the elements i[l:] and the initial value of the accumulator is a tuple with the first l elements: (i[:l]).
Check backward and forward for the l elements equal to the string tuple t, if true then add the tuple: acc[:-l] + (j, e,) else continue with normal concatenation: acc + (e,).

This is a functional approach no data is modified but generated, so should be safe to have multiple processes at the same time (in theory, I'm no expert about the Python interpreter).

If the code above is too weird for people not into functional programming this is another approach:

def cons_n_iter(tuple_list, tuple_seq):
     jnt = ''.join(tuple_seq)
     lnt = len(tuple_seq)
     res = []
     for word in tuple_list:
         acc = (word[:lnt])
         for letter in word[lnt:]:
             if acc[-lnt:] == tuple_seq or acc[:lnt] == tuple_seq:
                 acc = acc[:-lnt] + (jnt, letter,)
             else:
                 acc += (letter,)
         res += (acc,)
     return res

print cons_n_iter(cin, ('i', 's'))

The logic is the same as the functional approach, same use of the accumulator. In this case the res accumulator is explicit because in the examples above reduce was taking care of it.

score 3 · Answer 3 · answered Oct 31 '16 at 20:18

3

Is this what you need? using re.

import re,ast
cin = [('t','h',"i",'s', '\ue000'), ('c', 'i', 's', 'p')]
cin = re.sub(r"i'[,\s]+'s", r"is",str(cin))
cin = ast.literal_eval(cin)

answered Oct 31 '16 at 20:18

Shaowen

45
2
7

No. That still doesn't solve the problem that it needs to be converted back and forth as strings and list of tuples. – alvas Oct 31 '16 at 23:19

How to do Byte Pair Encoding bigram counting and replacements efficiently in Python?

3 Answers3

Your original code: