-1

I'm looking at the official Tensorflow example for Word2Vec. They created a dictionary for all the words, and then created a reverse dictionary, and the reverse dictionary was mainly used in the rest of the code.

The line in question:

reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) 

Full code block

vocabulary_size = 50000

def build_dataset(words):
  count = [['UNK', -1]]
  count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
  dictionary = dict()
  for word, _ in count:
    dictionary[word] = len(dictionary)
  data = list()
  unk_count = 0
  for word in words:
    if word in dictionary:
      index = dictionary[word]
    else:
      index = 0  # dictionary['UNK']
      unk_count = unk_count + 1
    data.append(index)
  count[0][1] = unk_count
  reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) 
  return data, count, dictionary, reverse_dictionary

data, count, dictionary, reverse_dictionary = build_dataset(words)

Full official implementation.

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/udacity/5_word2vec.ipynb

This the official implementation from Tensorflow, so there must be a good reason why they did this

SantoshGupta7
  • 5,607
  • 14
  • 58
  • 116

1 Answers1

1

To build the list data, the build_dataset() function requires a word to index mapping.

For use in subsequent functionality, an index to word mapping is required.

In Python, as in most languages, there is no structure for a memory-efficient injective two-way mapping. Therefore, your function creates and stores two dictionaries.

Note that the logic can be written more simply with enumerate and a dictionary comprehension:

from operator import itemgetter

reverse_dictionary = dict(enumerate(map(itemgetter(0), count)))
dictionary = {v: k for k, v in reverse_dictionary.items()}
jpp
  • 159,742
  • 34
  • 281
  • 339