26

I have an generator (a function that yields stuff), but when trying to pass it to gensim.Word2Vec I get the following error:

TypeError: You can't pass a generator as the sentences argument. Try an iterator.

Isn't a generator a kind of iterator? If not, how do I make an iterator from it?

Looking at the library code, it seems to simply iterate over sentences like for x in enumerate(sentences), which works just fine with my generator. What is causing the error then?

BartoszKP
  • 34,786
  • 15
  • 102
  • 130
riv
  • 6,846
  • 2
  • 34
  • 63
  • 3
    Well ... They went to a lot of trouble to prevent you from using generators: https://github.com/piskvorky/gensim/blob/839513f81e3aa42f490331fa80a28d13b7b7026f/gensim/models/word2vec.py#L434 – mgilson Dec 08 '15 at 21:32
  • 3
    That makes no damn sense. – user2357112 Dec 08 '15 at 21:33
  • 2
    @user2357112 -- Perhaps the input needs to be iterated over multiple times. The docs say that a list is an OK input. (Of course, in that case `iterator` is definitely the _wrong_ term to put in the error message). – mgilson Dec 08 '15 at 21:34
  • @riv Then you can just change your generator to a list comprehesion. – Tamas Hegedus Dec 08 '15 at 21:38
  • [I found the issue the check was supposed to address.](https://github.com/piskvorky/gensim/issues/319) It doesn't look like the people in the comment chain had a clear understanding of the vocabulary at the time. This error message should definitely be changed (and perhaps they should add `or iter(sentences) is iter(sentences)` to catch other iterator types). – user2357112 Dec 08 '15 at 21:39

4 Answers4

15

Generator is exhausted after one loop over it. Word2vec simply needs to traverse sentences multiple times (and probably get item for a given index, which is not possible for generators which are just a kind of stacks where you can only pop), thus requiring something more solid, like a list.

In particular in their code they call two different functions, both iterate over sentences (thus if you use generator, the second one would run on an empty set)

self.build_vocab(sentences, trim_rule=trim_rule)
self.train(sentences)

It should work with anything implementing __iter__ which is not GeneratorType. So wrap your function in an iterable interface and make sure that you can traverse it multiple times, meaning that

sentences = your_code
for s in sentences:
  print s
for s in sentences:
  print s

prints your collection twice

lejlot
  • 64,777
  • 8
  • 131
  • 164
  • 3
    All iterators are exhausted after one loop, not only those created by generators. (Continuing to raise `StopIteration` on subsequent calls to `next` after exhaustion is a requirement, in fact.) The error message probably meant to say **iterable** instead of **iterator**. Some iterables can be looped over many times, as correctly explained in your answer. – user4815162342 Dec 08 '15 at 21:43
  • I see, but wouldn't an iterator only be iterable once if it returns self in `__iter__`? I'm also thinking that the error message meant to say iterable. – riv Dec 08 '15 at 21:44
  • 1
    @riv An iterator **has** to return `self` in `__iter__`. An *iterable* doesn't, however, it can (and does in case of built-in containers) return a new iterator that starts at the beginning. – user4815162342 Dec 08 '15 at 21:45
  • I just made a class that returns my generator in the `__iter__` method (and has no other methods) and it worked. – riv Dec 08 '15 at 21:55
  • 1
    @riv It may *look* like it's working because you're no longer getting an exception, but is it working correctly? If both `build_vocab` and `train` iterate over the `sentences` iterator, `train` will encounter an empty iterator. It's very likely the code doesn't train at all. The proper fix is explained by Alex Volkov, except it can be spelled more shortly as `list(generator_obj)`. – user4815162342 Dec 08 '15 at 22:27
  • Interesting [comment from Guido](https://github.com/python/mypy/issues/4707#issuecomment-487091762) around this that suggests there was never a guarantee that Iterables could be re-used and the error is with functions accepting Iterable then assuming they could iterate multiple times. – Philip Couling Mar 30 '23 at 19:59
8

As previous posters are mentioned, generator acts similarly to iterator with two significant differences: generators get exhausted, and you can't index one.

I quickly looked up the documentation, on this page -- https://radimrehurek.com/gensim/models/word2vec.html

The documentation states that

gensim.models.word2vec.Word2Vec(sentences=None, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0, seed=1, workers=1, min_alpha=0.0001, sg=1, hs=1, negative=0, cbow_mean=0, hashfxn=, iter=1, null_word=0, trim_rule=None, sorted_vocab=1) ...

Initialize the model from an iterable of sentences. Each sentence is a list of words (unicode strings) that will be used for training.

I'm venture to guess that the logic inside of the function inherently requires one or more list properties such as item indexing, there might be an explicit assert statement or if statement that raises an error.

A simple hack that can solve your problem is turning your generator into list comprehension. Your program is going to sustain CPU performance penalty and will increase its memory usage, but this should at least make the code work.

my_iterator = [x for x in generator_obj]
Alex Volkov
  • 2,812
  • 23
  • 27
  • Note that list created by the comprehension in your answer is an iterable (has an `__iter__` method), but not an iterator (has no `next` method). – user4815162342 Dec 08 '15 at 21:47
  • Yes you're right, it's list comprehension, not iterator, I'll change my answer. – Alex Volkov Dec 08 '15 at 21:49
  • But it means that you can't train word2vec on a very large corpus. Gensim library, however, prides itself on being memory-efficient. – Sergey Orshanskiy Feb 03 '17 at 00:25
  • @osa This was a quick solution for this particular problem. You will likely need to write something more involved for your case i.e. implement your own iterator that could be cycled multiple times, where you can trade off memory use for I/O, by re-reading a file several times, see -- sample iterator http://stackoverflow.com/questions/19151/build-a-basic-python-iterator; see itertools.cycle -- https://docs.python.org/2/library/itertools.html#itertools.cycle – Alex Volkov Feb 03 '17 at 20:15
4

Other answers have pointed out that Gensim requires two passes to build the Word2Vec model: once to build the vocabulary (self.build_vocab), and the second to train the model (self.train). You can still pass a generator to the train method (e.g., if you're streaming data) by breaking apart the build_vocab and train methods.

from gensim.models import Word2Vec

model = Word2Vec()
sentences = my_generator()  # first pass
model.build_vocab(sentences)

sentences = my_generator()  # second pass of same data
model.train(sentences2, 
            total_examples=num_sentences,  # total number of documents to process
            epochs=model.epochs)
David C
  • 7,204
  • 5
  • 46
  • 65
3

It seems gensim throws a misleading error message.

Gensim wants to iterate over your data multiple times. Most libraries just build a list from the input, so the user doesn't have to care about supplying a multiple iterable sequence. Of course, generating an in-memory list can be very resource-consuming, while iterating over a file for example, can be done without storing the whole file in memory.

In your case, just changing the generator to a list comprehesion should solve the problem.

Tamas Hegedus
  • 28,755
  • 12
  • 63
  • 97