1

I want to generate generate n-grams from a sequence of tokens:

bigram:: "1 3 4 5" --> { (1,3), (3,4), (4,5) }

After searching I found this thread that used:

def find_ngrams(input_list, n):
  return zip(*[input_list[i:] for i in range(n)])

If I use this piece of code during my training time I think it kills the performance. So I looking for a better option.

Amir
  • 16,067
  • 10
  • 80
  • 119

1 Answers1

2

If you need to generate bigram in string format:

import tensorflow as tf

tf.enable_eager_execution()

sentence = ['this is example sentence']
x = tf.string_split(sentence).values[:-1] + ' ' + tf.string_split(sentence).values[1:]

# tf.Tensor([b'this is' b'is example' b'example sentence'], shape=(3,), dtype=string)

You can also use tensorflow-transform to generate ngrams.

import tensorflow_transform as tft

tft.ngrams(tensor, (1,2), " ")

Note: tensorflow-transform only supports python 2 until 22 January 2019.

Amir
  • 16,067
  • 10
  • 80
  • 119
  • 1
    The added bonus w/ these tf-transform ops is that they are driven by core graph ops, so they work outside of python! At least w/ my small experiment w/ `ngrams`... – eggie5 Oct 21 '19 at 02:49