0

I'm trying to read a GloVe file: glove.twitter.27B.200d.txt. I have the next function to read the file:

def glove_reader(glove_file):
    glove_dict = {}
    with open(glove_file, 'rt', encoding='utf-8') as glove_reader:
        for line in glove_reader:
            tokens = line.rstrip().split()
            vect = [float(token) for token in tokens[1:]]
            glove_dict[tokens[0]] = vect
    return glove_dict

The problem is that I get the next error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xea in position 0: invalid continuation byte

I tried with latin-1 but it didn't work. Throws me the next error: ValueError: could not convert string to float: 'Ù\x86'

I also tried change 'rt' with 'r' and 'rb'. I think is a problem of macOS because in Windows didn't throw me this error. Can someone please help me to know why I can't read this file.

Luis Miguel
  • 193
  • 1
  • 2
  • 10
  • I'm not familiar with Glove, but you might want to confirm the file's encoding with `file glove.twitter.27B.200d.txt`, cause it doesn't seem to be UTF-8. – wjandrea Sep 28 '19 at 02:34
  • 1
    @Luis Miguel It may be helpful to include an example GloVe file which is valid and that triggers the error mentioned above. – HumbleOne Sep 28 '19 at 02:35
  • https://stackoverflow.com/questions/41272247/unicodedecodeerror-utf8-codec-cant-decode-byte-0xea – AidanGawronski Sep 28 '19 at 02:36
  • Can you retest `open(glove_file, 'r', encoding='latin-1')` and report any errors it gives? It definitely should not return a utf-8 error. – whydoubt Sep 28 '19 at 02:43
  • @whydoubt I update the question with the error. – Luis Miguel Sep 28 '19 at 03:02
  • AFAICT, the file does contains proper utf-8 data (unless your copy got altered somehow). Do you get a traceback with the error, that can tell you what line failure occurs at? – whydoubt Sep 28 '19 at 05:00
  • If you use `open(glove_file, 'rb')` you should not get any encoding-related errors, as everything should be done as `bytes`. If your keys in `glove_dict` need to be of type `str`, then you will need to perform a decode at that point: `glove_dict[tokens[0].decode()] = vect`. If that causes an error, I would wrap it with try/except and print tokens[0] on exception to see what bytes sequence is causing it to fail. – whydoubt Sep 29 '19 at 17:06

0 Answers0