1

I have downloaded a pre-trained word2vec model for my native language. It had a "news.model.bin" file and when i unzipped it, expected to see a txt file or pickle, but i found another .bin file in it with a chaos consistat like this:

\09\b9\.,-;sdfkf %some really strange symbols and seem to be invalid symbols%

I can not even copy it, because, i cant open the file normally-it is heavy and my laptop just dies. The question is: Can this exampled code be a pre-tranes model or not? If yes- how am i supposed to deal with it?

P.S. The link, where i got the model from(models are at the bottom of the page):http://ling.go.mail.ru/dsm/ru/about

Franck Dernoncourt
  • 77,520
  • 72
  • 342
  • 501
  • A quick google turned [this](http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/). I think it's specialized format for word2vec. Hope it helps. – Kh40tiK Nov 26 '16 at 14:39
  • Possible duplicate of [Convert word2vec bin file to text](http://stackoverflow.com/questions/27324292/convert-word2vec-bin-file-to-text) – Franck Dernoncourt Nov 26 '16 at 16:16

1 Answers1

0

Two solutions:

  1. Convert the .bin to .txt: Convert word2vec bin file to text
  2. Directly read the .bin as shown below.

https://gist.github.com/j314erre/b7c97580a660ead82022625ff7a644d8 contains some code to read the .bin and load it into a TensorFlow variable:

    # Initialize all variables
    sess.run(tf.initialize_all_variables())
    # Initialize all variables
    sess.run(tf.initialize_all_variables())
    if FLAGS.word2vec:
        # initial matrix with random uniform
        initW = np.random.uniform(-0.25,0.25,(len(vocab_processor.vocabulary_), FLAGS.embedding_dim))
        # load any vectors from the word2vec
        print("Load word2vec file {}\n".format(FLAGS.word2vec))
        with open(FLAGS.word2vec, "rb") as f:
            header = f.readline()
            vocab_size, layer1_size = map(int, header.split())
            binary_len = np.dtype('float32').itemsize * layer1_size
            for line in xrange(vocab_size):
                word = []
                while True:
                    ch = f.read(1)
                    if ch == ' ':
                        word = ''.join(word)
                        break
                    if ch != '\n':
                        word.append(ch)   
                idx = vocab_processor.vocabulary_.get(word)
                if idx != 0:
                    initW[idx] = np.fromstring(f.read(binary_len), dtype='float32')  
                else:
                    f.read(binary_len)    

     sess.run(cnn.W.assign(initW))

You can use this code in this text classification example in TensorFlow.

FYI:

Community
  • 1
  • 1
Franck Dernoncourt
  • 77,520
  • 72
  • 342
  • 501