Simplifying ngram loops to compress the string given a fix set of ngrams

Question

Given in list of characters, list('Hello▁world▁') and a list of character tuples, i.e.

[('l', 'l'), ('ell', 'o▁'), ('Hell', 'o▁'), ('w', 'or'), ('o', 'r'), ('e', 'l'), ('el', 'l'), ('H', 'ell'), ('H', 'e'), ('He', 'll'), ('worl', 'd▁'), ('wor', 'l'), ('l', 'd▁'), ('d', '▁'), ('wor', 'ld▁'), ('H', 'el'), ('o', '▁'), ('w', 'o'), ('l', 'o▁'), ('l', 'o')]

The objective is to iterate through the tuples and collapse the list of characters if they match. I've tried this and it works:

import copy

def matcher(s, ngram):
  while s:
    window = tuple(s[:2]) # Since the string tuples are in pairs.
    if window == ngram:
      yield "".join(window)
      s = s[2:]
    else:
      yield s[0]
      s = s[1:]

def combine_ngrams(s, vocab):
  prev = copy.copy(s)
  while True:
    for v in vocab:
      s = list(matcher(s, v))
    if s == prev:
      break
    else:
      prev = s
  return s


vocab = [('l', 'l'), ('ell', 'o▁'), ('Hell', 'o▁'), ('w', 'or'), ('o', 'r'), ('e', 'l'), ('el', 'l'), ('H', 'ell'), ('H', 'e'), 
         ('He', 'll'), ('worl', 'd▁'), ('wor', 'l'), ('l', 'd▁'), ('d', '▁'), ('wor', 'ld▁'), ('H', 'el'), ('o', '▁'), ('w', 'o'), ('l', 'o▁'), ('l', 'o')]

s = list('Hello▁world▁')

combine_ngrams(s, vocab)

[out]:

['Hello▁', 'world▁']

But the multiple while loops in both the outer function combine_ngrams() and inner matcher() looks like something that can be easily simplified.

Or maybe the operations doesn't need to loop through the tuples and maybe some regex methods to iteratively apply the vocab substitution would work. Is there a way to simply the nested while loops in the combine_ngrams function?

Here's more input/output examples:

[in]:

s = list('abcde'); vocab = [('a', 'b'), ('b', 'c'), ('a', 'bc'), ('abc', 'd'), ('abcd', 'e')]

s = list('abcde'); vocab = [('a', 'b'), ('ab', 'c'), ('b', 'c'),  ('a', 'bc'), ('abc', 'd'), ('abcd', 'e')]

s = list('aaab'); vocab = [('a', 'a'), ('a', 'aa'), ('aaa', 'b')]

s = list('Hello▁ポケモンセンター▁world▁'); vocab = [('l', 'l'), ('ell', 'o▁'), ('Hell', 'o▁'), ('w', 'or'), ('o', 'r'), ('e', 'l'), ('el', 'l'), ('H', 'ell'), ('H', 'e'), ('He', 'll'), ('worl', 'd▁'), ('wor', 'l'), ('l', 'd▁'), ('d', '▁'), ('wor', 'ld▁'), ('H', 'el'), ('o', '▁'), ('w', 'o'), ('l', 'o▁'), ('l', 'o')]

[out]:

['ab', 'c', 'd', 'e']

['abcde']

['aa', 'a', 'b']

['Hello▁', 'ポ', 'ケ', 'モ', 'ン', 'セ', 'ン', 'タ', 'ー', '▁', 'world▁']

P/S: For anyone interested this is related to the byte-pair encoding algorithm and if there's a more algorithmic rather than pythonic loop way to solve this problem, please do suggest.

yes that is correct because the list of tuples in vocab is kinda sorted with some rankings implicitly. So the output expected is `['ab', 'c', 'd', 'e']` — alvas, Apr 25 '23 at 12:39
Using `vocab = [('a', 'b'), ('ab', 'c'), ('b', 'c'), ('a', 'bc'), ('abc', 'd'), ('abcd', 'e')]; s=list('abcde')` will return `['abcde']` — alvas, Apr 25 '23 at 12:46
Correct `vocab = [('a', 'a'), ('a', 'aa'), ('aaa', 'b')]; s = list('aaab')` will return `['aa', 'a', 'b']` — alvas, Apr 25 '23 at 12:47
Yes, in that case, it should be character by itself, e.g. with same vocab in question and `s = list('Hello▁ポケモンセンター▁world▁')` , it returns `['Hello▁', 'ポ', 'ケ', 'モ', 'ン', 'セ', 'ン', 'タ', 'ー', '▁', 'world▁'] ` — alvas, Apr 25 '23 at 12:53
No I mean are there any characters that are guaranteed to not appear in `s`? So that we can use one as a delimiter to join `s` into a single string (useful for regex). — Kelly Bundy, Apr 25 '23 at 12:58
Unfortunately, there's no guarantee to what cannot appear in `s`, since the normal basic byte-pair algorithm considers any bytes possible. — alvas, Apr 25 '23 at 12:59
Maybe if you want to you can try using `\xa0` if you want to use a character to control/use regex. — alvas, Apr 25 '23 at 13:00
Not really, but I think we can make some assumption here and "pre-clean" the input to get rid of `\xa0` before the byte-pair process. — alvas, Apr 25 '23 at 13:09
Just to confirm, what is your goal? I wasn't sure if it was simplicity, readability, or speed. I mean, I also see some complexity in your code, but I would like to know more clearly defined goals. — ken, May 03 '23 at 11:50
Speed and simplicity. Now it's a little too spaghetti like, though it works pretty fast. — alvas, May 03 '23 at 13:36

ken · Answer 1 · 2023-05-04T13:01:57.760

Instead of regular expression, you can use plain text string replacement.

This approach requires a delimiter that is never contained in the input. Also, a delimiter should not contain the same sequence repeatedly for str.split to work.

I have not come up with a general algorithm for choosing the appropriate delimiter. However, for practical purposes, you could define a string (sequence) that cannot be used for input, using Unicode Private Use Area. Note that you do not have to forbid the use of some characters, only the use of a particular string.

def combine_ngrams2(s, vocab):
    sep = "\uE000"
    ss = "".join(s)
    while sep in ss:
        sep += chr(0xE000 + len(sep))
    assert sep not in ss and len(set(sep)) == len(sep)

    prev = None
    current = f"{sep}{sep.join(s)}{sep}"
    while prev != current:
        prev = current
        for v1, v2 in vocab:
            current = current.replace(f"{sep}{v1}{sep}{v2}{sep}", f"{sep}{v1}{v2}{sep}")

    return current[len(sep) : -len(sep)].split(sep)

This solution may seem like a dumb implementation, but str.replace is so well optimized that it would be hard to find a better alternative.

score 0 · Answer 2 · answered May 05 '23 at 08:33

Yes, there is a way to simplify the nested while loops in the combine_ngrams() function. One approach would be to use a regular expression to find the longest match in the input string for each tuple in the vocabulary. This would eliminate the need for the matcher() function and the inner while loop.

import re

def combine_ngrams(s, vocab):
    pattern = '|'.join(['(?:' + re.escape(a) + ')' for a, b in vocab]) # create regex pattern
    while True:
        prev = ''.join(s)
        s = re.sub(pattern, lambda m: vocab[m.lastindex - 1][1], prev) # apply substitution
        if s == prev:
            break
        s = list(s)
    return s

here's what happens when we put the input to the function

s = list('Hello▁world▁')
vocab = [('l', 'l'), ('ell', 'o▁'), ('Hell', 'o▁'), ('w', 'or'), ('o', 'r'), ('e', 'l'), ('el', 'l'), ('H', 'ell'), ('H', 'e'),          ('He', 'll'), ('worl', 'd▁'), ('wor', 'l'), ('l', 'd▁'), ('d', '▁'), ('wor', 'ld▁'), ('H', 'el'), ('o', '▁'), ('w', 'o'), ('l', 'o▁'), ('l', 'o')]
print(combine_ngrams(s, vocab)) # output: ['Hello▁', 'world▁']

s = list('abcde')
vocab = [('a', 'b'), ('b', 'c'), ('a', 'bc'), ('abc', 'd'), ('abcd', 'e')]
print(combine_ngrams(s, vocab)) # output: ['e']

s = list('abcde')
vocab = [('a', 'b'), ('ab', 'c'), ('b', 'c'),  ('a', 'bc'), ('abc', 'd'), ('abcd', 'e')]
print(combine_ngrams(s, vocab)) # output: ['e']

s = list('aaab')
vocab = [('a', 'a'), ('a', 'aa'), ('aaa', 'b')]
print(combine_ngrams(s, vocab)) # output: ['b']

s = list('Hello▁ポケモンセンター▁world▁')
vocab = [('l', 'l'), ('ell', 'o▁'), ('Hell', 'o▁'), ('w', 'or'), ('o', 'r'), ('e', 'l'), ('el', 'l'), ('H', 'ell'), ('H', 'e'), ('He', 'll'), ('worl', 'd▁'), ('wor', 'l'), ('l', 'd▁'), ('d', '▁'), ('wor', 'ld▁'), ('H', 'el'), ('o', '▁'), ('w', 'o'), ('l', 'o▁'), ('l', 'o')]
print(combine_ngrams(s, vocab)) # output: ['Hello▁', 'ポ', 'ケ', 'モ', 'ン', 'セ', 'ン', 'タ', 'ー', '▁world▁']

Hope this helped you to accomplish this using regex patterns...

Simplifying ngram loops to compress the string given a fix set of ngrams

2 Answers2