Get trouble to load glove 840B 300d vector

Question

It seems the format is, for every line, the string is like 'word number number .....'. So it easy to split it. But when I split them with the script below

import numpy as np
def loadGloveModel(gloveFile):
    print "Loading Glove Model"
    f = open(gloveFile,'r')
    model = {}
    for line in f:
        splitLine = line.split()
        word = splitLine[0]
        embedding = np.array([float(val) for val in splitLine[1:]])
        model[word] = embedding
    print "Done.",len(model)," words loaded!"
    return model

I load the glove 840B 300d.txt. but get error and I print the splitLine I got

['contact', 'name@domain.com', '0.016426', '0.13728', '0.18781', '0.75784', '0.44012', '0.096794' ... ]

or

['.', '.', '.', '.', '0.033459', '-0.085658', '0.27155', ...]

Please notice that this script works fine in glove.6b.*

Looks like a problem with the downloaded file. See this answer as an example - https://stackoverflow.com/a/47758616/712995 — Maxim, Mar 03 '18 at 12:54
Actually, I find all of the lines that will cause error, except for '.'*n , others are `['in', 'emailing', 'Email', 'email', 'At', 'at', 'by', 'to', 'in', 'or', '•', 'Contact','contact', 'is', 'on']` — Linjie Xu, Mar 03 '18 at 13:36
Could you please tell me the size of your file in zip or just txt size? — Linjie Xu, Mar 03 '18 at 17:47
Do you have glove.840B version? This script works fine in 6B version. — Linjie Xu, Mar 03 '18 at 17:52

Weikai · Answer 1 · 2018-05-10T12:47:35.780

The code works fine for files: glove.6B.*d.txt, glove.42B.*d.txt, but not glove.6B.300d.txt. This is because glove.6B.300d.txt contains spaces in a word. For example, it has a word like this: '. . .' and there are spaces between those dots. I solve this problem by changing this line:

splitLine = line.split()

into

splitLine = line.split(' ')

So you code must be like this:

import numpy as np
def loadGloveModel(gloveFile):
    print "Loading Glove Model"
    f = open(gloveFile,'r', encoding='utf8')
    model = {}
    for line in f:
        splitLine = line.split(' ')
        word = splitLine[0]
        embedding = np.asarray(splitLine[1:], dtype='float32')
        model[word] = embedding
    print "Done.",len(model)," words loaded!"
    return model

score 1 · Answer 2 · answered Feb 14 '19 at 06:37

I think the following may help:

def process_glove_line(line, dim):
    word = None
    embedding = None

    try:
        splitLine = line.split()
        word = " ".join(splitLine[:len(splitLine)-dim])
        embedding = np.array([float(val) for val in splitLine[-dim:]])
    except:
        print(line)

    return word, embedding

def load_glove_model(glove_filepath, dim):
    with open(glove_filepath, encoding="utf8" ) as f:
        content = f.readlines()
        model = {}
        for line in content:
            word, embedding = process_glove_line(line, dim)
            if embedding is not None:
                model[word] = embedding
        return model

model= load_glove_model("glove.840B.300d.txt", 300)

Get trouble to load glove 840B 300d vector

2 Answers2