0

I am trying to read the a bin file. It has millions of lines of a word followed by space separated numbers.

So, far in Python I havent been able to get a line printed, it either gives gibberish or wrong output.

with open('GoogleNews-vectors-negative300.bin', mode='rb') as file: # b is important -> binary
    for line in file.readline():
        print(line)

How should I read a binary file line by line?

Rafael
  • 651
  • 13
  • 30

1 Answers1

2

Binary files tend to not be line-oriented. They also will show gibberish when printed. So your code is working, but your expectations are wrong.

What's your ultimate goal? If it's to have usable word-vectors, you probably want to use some pre-existing Word2Vec library, such as gensim in Python.

In such a library, you can also view the source-code for reading the .bin word-vectors format, as a model to learn from, if for some reason you really do need to write your own reading code. For example, here's the gensim source code that reads word-vector files in the format written by the original word2vec.c code from Google:

https://github.com/RaRe-Technologies/gensim/blob/3c3506d51a2caf6b890de3b1b32a8b85f7566ca5/gensim/models/utils_any2vec.py#L123

(It's more often used from the KeyedVectors.load_word2vec_format() public API method.)

gojomo
  • 52,260
  • 14
  • 86
  • 115