How to Train GloVe algorithm on my own corpus

Question

I tried to follow this.
But some how I wasted a lot of time ending up with nothing useful.
I just want to train a GloVe model on my own corpus (~900Mb corpus.txt file). I downloaded the files provided in the link above and compiled it using cygwin (after editing the demo.sh file and changed it to VOCAB_FILE=corpus.txt . should I leave CORPUS=text8 unchanged?) the output was:

cooccurrence.bin
cooccurrence.shuf.bin
text8
corpus.txt
vectors.txt

How can I used those files to load it as a GloVe model on python?

score 20 · Answer 1 · edited Sep 10 '21 at 02:19

20

You can do it using GloVe library:

Install it: pip install glove_python

Then:

from glove import Corpus, Glove

#Creating a corpus object
corpus = Corpus() 

#Training the corpus to generate the co-occurrence matrix which is used in GloVe
corpus.fit(lines, window=10)

glove = Glove(no_components=5, learning_rate=0.05) 
glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)
glove.add_dictionary(corpus.dictionary)
glove.save('glove.model')

Reference: word vectorization using glove

edited Sep 10 '21 at 02:19

Mehdi Abbassi

627
1
7
24

answered Jun 25 '19 at 07:36

Minions

5,104
5
50
91

10

Had to use pip install glove==1.0.0 on Windows 10 and on Linux Mint 19.3. All sorts of errors trying to install glove_python – Thom Ives Feb 20 '20 at 23:38
The library is difficult to work with since it seems not maintained anymore. I found this approach which uses the Word2Vec algorithm (a slightly different approach, however). But note that there is a mistake. Here is the solution: `model.wv.intersect_word2vec_format(pretrained_path, binary=False, lockf=1.0)` – David Beauchemin Oct 26 '22 at 20:05
Also, see this fix for the `IndexError` https://github.com/RaRe-Technologies/gensim/issues/3094. – David Beauchemin Oct 26 '22 at 20:06

Palak Bansal · Answer 2 · 2021-01-28T10:01:01.143

This is how you run the model

$ git clone http://github.com/stanfordnlp/glove
$ cd glove && make

To train it on your own corpus, you just have to make changes to one file, that is demo.sh.

Remove the script from if to fi after 'make'. Replace the CORPUS name with your file name 'corpus.txt' There is another if loop at the end of file 'demo.sh'

if [ "$CORPUS" = 'text8' ]; then

Replace text8 with your file name.

Run the demo.sh once the changes are made.

$ ./demo.sh

Make sure your corpus file is in the correct format.You'll need to prepare your corpus as a single text file with all words separated by one or more spaces or tabs. If your corpus has multiple documents, the documents (only) should be separated by new line characters.

score 3 · Answer 3 · answered Mar 01 '18 at 21:41

3

your corpus should go to variable CORPUS. The vectors.txt is the output, which suppose to be useful. You can train Glove in python, but it takes more time and you need to have C compiling environment. I tried it before and won't recommend it.

answered Mar 01 '18 at 21:41

MLam

161
1
2
10

1

I down-voted this answer because it does not elaborate on why you do not recommend using Python, or if there are particular use cases where Python would be preferable. I think there are good reasons to go in either direction depending on the size of the corpus, the user's comofrt level with Python, etc. – Matt L. Jul 30 '19 at 19:55

score 2 · Answer 4 · answered Jul 13 '18 at 06:10

Here is my take on this::

After cloning the repository, edit the demo.sh file as you have to train it using your own corpus replace the CORPUS name with your file's name.
Then remove the script between MAKE and CORPUS as that is for downloading an example corpus for you.
Then run make which will form the four files in the build folder.
Now run ./demo.sh which will train and do all the stuff mentioned in the script on your own corpus and output will be generated as vectors.txt file.

Note : Don't forget to keep your corpus file directly inside the Glove folder.

How to Train GloVe algorithm on my own corpus

4 Answers4

Linked