1

I am trying to parse the Glove6b50d data from Kaggle in via Google Colab, then run it through the word2vec process (apologies for the huge URL - it's the fastest link I've found). However, I'm hitting a bug where '-' tokens are not parsed correctly, resulting in the above error.

I have attempted to handle this in a few ways. I've also looked into the load_word2vec_format method itself and tried to ignore errors, however it doesn't seem to make a difference. I've tried a map method on line two, following combinations of advice from these links: [a] and [b]. This hasn't fixed or changed the error message received (i.e. removing it changes nothing in the text).

gloveFile = pd.read_fwf("https://storage.googleapis.com/kagglesdsdata/datasets/652874/1154868/glove.6B.50d.txt?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1589683535&Signature=kaS%2FTkSmvp7lhqwLJ%2B1lyuvP76PcDpwK1dnsCZEO0AiVXqQm7jsBc1r5g9af%2BuVkOSvMgqUDXYL4O%2BN43pnL5RLs7ns%2B3w%2BEtCYDTfJz6q1O0zfPz4%2BTcD3GV7UAGgVjVNIvncC9fHWcd2YuKwiZaTvKL%2BGRnMkf9b%2BYnOweYeXEeA1sX005krj%2FLMBbVTXmDTwOtN4HwVNb3%2BrbezkWkoEC6sxLPnGcsEKaBe%2Biv%2FuVSQG5FsQlwvRgsSU%2FMgk0c4bi%2FHxF04lrQW0E0s767TIXwHeodRHYpk5KQeKmyd91uKD2Zb8v8xQcf2%2BkmSNGQHbX0mDz8HBwYEmOdV7aMQ%3D%3D&response-content-disposition=attachment%3B+filename%3Dglove.6B.50d.txt",
                    delimiter="\n\t\s+", header=None)

map(lambda gloveFile: gloveFile.replace(r'[^\x00-\x7F]+' , '-'), gloveFile[0])

numpy.savetxt(r'/usr/local/lib/python3.6/dist-packages/gensim/test/test_data/glove6b50d.txt', gloveFile.values, fmt="%s")

from gensim.models import KeyedVectors
from gensim.test.utils import datapath, get_tmpfile
from gensim.scripts.glove2word2vec import glove2word2vec

glove_file = datapath('glove6b50d.txt')

glove2word2vec(glove_file, "glove6b50d_word2vec.txt")

model = KeyedVectors.load_word2vec_format("glove6b50d_word2vec.txt", binary=False)

Per the comment below, the exact error I'm getting is as follows:

/usr/local/lib/python3.6/dist-packages/smart_open/smart_open_lib.py:253: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-132-6ad5a51f4fb3> in <module>()
      9 glove2word2vec(glove_file, "glove6b50d_word2vec.txt")
     10 
---> 11 model = KeyedVectors.load_word2vec_format("glove6b50d_word2vec.txt", binary=False)
     12 

2 frames
/usr/local/lib/python3.6/dist-packages/gensim/models/utils_any2vec.py in <listcomp>(.0)
    220                 if len(parts) != vector_size + 1:
    221                     raise ValueError("invalid vector on line %s (is this really the text format?)" % line_no)
--> 222                 word, weights = parts[0], [datatype(x) for x in parts[1:]]
    223                 add_word(word, weights)
    224     if result.vectors.shape[0] != len(result.vocab):

ValueError: could not convert string to float: '-'

The system works fine using a text file containing only: "test -1.0 1.526 -2.55" or "- -1.0 1.526 -2.55". Additionally, searching the source text file (glove.6B.50d.txt) for occurrences of " - " comes up with no results. I'm on Windows, so I have done so by executing:

findstr /C:" - " glove.6B.50d.txt

Calling print(gloveFile) both pre- and post-map call provide the following output. Note that I've kept the mapping call in for completeness of my efforts, not for its effect.

0       the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.0...
1       , 0.013441 0.23682 -0.16899 0.40951 0.63812 0....
2       . 0.15164 0.30177 -0.16763 0.17684 0.31719 0.3...
3       of 0.70853 0.57088 -0.4716 0.18048 0.54449 0.7...
4       to 0.68047 -0.039263 0.30186 -0.17792 0.42962 ...
...                                                   ...
399995  chanty 0.23204 0.025672 -0.70699 -0.045465 0.1...
399996  kronik -0.60921 -0.67218 0.23521 -0.11195 -0.4...
399997  rolonda -0.51181 0.058706 1.0913 -0.55163 -0.1...
399998  zsombor -0.75898 -0.47426 0.4737 0.7725 -0.780...
399999  andberger 0.072617 -0.51393 0.4728 -0.52202 -0...

If I print the first ten lines of the glove6b50d_word2vec.txt file, I get the following text, which matches the word2vec format. Additionally, if I count the occurrences of the string " - " in the document, I find none.

['400000 50\n', 'the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581\n', ', 0.013441 0.23682 -0.16899 0.40951 0.63812 0.47709 -0.42852 -0.55641 -0.364 -0.23938 0.13001 -0.063734 -0.39575 -0.48162 0.23291 0.090201 -0.13324 0.078639 -0.41634 -0.15428 0.10068 0.48891 0.31226 -0.1252 -0.037512 -1.5179 0.12612 -0.02442 -0.042961 -0.28351 3.5416 -0.11956 -0.014533 -0.1499 0.21864 -0.33412 -0.13872 0.31806 0.70358 0.44858 -0.080262 0.63003 0.32111 -0.46765 0.22786 0.36034 -0.37818 -0.56657 0.044691 0.30392\n', '. 0.15164 0.30177 -0.16763 0.17684 0.31719 0.33973 -0.43478 -0.31086 -0.44999 -0.29486 0.16608 0.11963 -0.41328 -0.42353 0.59868 0.28825 -0.11547 -0.041848 -0.67989 -0.25063 0.18472 0.086876 0.46582 0.015035 0.043474 -1.4671 -0.30384 -0.023441 0.30589 -0.21785 3.746 0.0042284 -0.18436 -0.46209 0.098329 -0.11907 0.23919 0.1161 0.41705 0.056763 -6.3681e-05 0.068987 0.087939 -0.10285 -0.13931 0.22314 -0.080803 -0.35652 0.016413 0.10216\n', 'of 0.70853 0.57088 -0.4716 0.18048 0.54449 0.72603 0.18157 -0.52393 0.10381 -0.17566 0.078852 -0.36216 -0.11829 -0.83336 0.11917 -0.16605 0.061555 -0.012719 -0.56623 0.013616 0.22851 -0.14396 -0.067549 -0.38157 -0.23698 -1.7037 -0.86692 -0.26704 -0.2589 0.1767 3.8676 -0.1613 -0.13273 -0.68881 0.18444 0.0052464 -0.33874 -0.078956 0.24185 0.36576 -0.34727 0.28483 0.075693 -0.062178 -0.38988 0.22902 -0.21617 -0.22562 -0.093918 -0.80375\n', 'to 0.68047 -0.039263 0.30186 -0.17792 0.42962 0.032246 -0.41376 0.13228 -0.29847 -0.085253 0.17118 0.22419 -0.10046 -0.43653 0.33418 0.67846 0.057204 -0.34448 -0.42785 -0.43275 0.55963 0.10032 0.18677 -0.26854 0.037334 -2.0932 0.22171 -0.39868 0.20912 -0.55725 3.8826 0.47466 -0.95658 -0.37788 0.20869 -0.32752 0.12751 0.088359 0.16351 -0.21634 -0.094375 0.018324 0.21048 -0.03088 -0.19722 0.082279 -0.09434 -0.073297 -0.064699 -0.26044\n', 'and 0.26818 0.14346 -0.27877 0.016257 0.11384 0.69923 -0.51332 -0.47368 -0.33075 -0.13834 0.2702 0.30938 -0.45012 -0.4127 -0.09932 0.038085 0.029749 0.10076 -0.25058 -0.51818 0.34558 0.44922 0.48791 -0.080866 -0.10121 -1.3777 -0.10866 -0.23201 0.012839 -0.46508 3.8463 0.31362 0.13643 -0.52244 0.3302 0.33707 -0.35601 0.32431 0.12041 0.3512 -0.069043 0.36885 0.25168 -0.24517 0.25381 0.1367 -0.31178 -0.6321 -0.25028 -0.38097\n', 'in 0.33042 0.24995 -0.60874 0.10923 0.036372 0.151 -0.55083 -0.074239 -0.092307 -0.32821 0.09598 -0.82269 -0.36717 -0.67009 0.42909 0.016496 -0.23573 0.12864 -1.0953 0.43334 0.57067 -0.1036 0.20422 0.078308 -0.42795 -1.7984 -0.27865 0.11954 -0.12689 0.031744 3.8631 -0.17786 -0.082434 -0.62698 0.26497 -0.057185 -0.073521 0.46103 0.30862 0.12498 -0.48609 -0.0080272 0.031184 -0.36576 -0.42699 0.42164 -0.11666 -0.50703 -0.027273 -0.53285\n', 'a 0.21705 0.46515 -0.46757 0.10082 1.0135 0.74845 -0.53104 -0.26256 0.16812 0.13182 -0.24909 -0.44185 -0.21739 0.51004 0.13448 -0.43141 -0.03123 0.20674 -0.78138 -0.20148 -0.097401 0.16088 -0.61836 -0.18504 -0.12461 -2.2526 -0.22321 0.5043 0.32257 0.15313 3.9636 -0.71365 -0.67012 0.28388 0.21738 0.14433 0.25926 0.23434 0.4274 -0.44451 0.13813 0.36973 -0.64289 0.024142 -0.039315 -0.26037 0.12017 -0.043782 0.41013 0.1796\n', '" 0.25769 0.45629 -0.76974 -0.37679 0.59272 -0.063527 0.20545 -0.57385 -0.29009 -0.13662 0.32728 1.4719 -0.73681 -0.12036 0.71354 -0.46098 0.65248 0.48887 -0.51558 0.039951 -0.34307 -0.014087 0.86488 0.3546 0.7999 -1.4995 -1.8153 0.41128 0.23921 -0.43139 3.6623 -0.79834 -0.54538 0.16943 -0.82017 -0.3461 0.69495 -1.2256 -0.17992 -0.057474 0.030498 -0.39543 -0.38515 -1.0002 0.087599 -0.31009 -0.34677 -0.31438 0.75004 0.97065\n']

My search methods are evidently thusfar ineffective. Would really appreciate some help.

  • 1
    What's the exact error messae(s) you're getting? (If a Python exception, with full error stack identifying involved lines of code.) Do you have similar problems with smaller files of the same format? – gojomo May 14 '20 at 05:15
  • Added the additional information to the main question. Thanks for your help. – Jordan MacLachlan May 14 '20 at 05:53
  • 1
    Thanks! What error were you getting before attempting the `...map...` line to patch some characters yourself? What do the 1st few lines of the original GLoVe file, and your patched version, look like? (eg: output of `head glove.6B.50d.txt` & `head glove6b50d.txt`.) The error is a pretty strong indication that a `'-'` is appearing where a float dimension is expected. Are any lines shown by `fgrep " - " glove6b50d.txt`? – gojomo May 14 '20 at 17:37
  • Hello! I have just (a) linked to the Kaggle-hosted data to show you the structure of the original text file, (b) added detail around the effect of the attempt to change the text on line two and (c) added detail around the output of `findstr`. Really appreciate your efforts, here! – Jordan MacLachlan May 14 '20 at 22:03
  • 1
    Thanks. If the `map` step doesn't change the error at all, I wouldn't do it: the minimum triggering example, and exact original error, before patch-arounds that resulted in no improvement, is most interesting. Is there a chance your download was truncated? It'd be interesting to see the 1st 5 lines and last 5 lines of whatever file you're passing to `load_word2vec_format()` (using whatever the Windows equivalents of `head`/`tail` might be). – gojomo May 15 '20 at 00:01
  • It is interesting, ae! I've added a block showing the result of `print(gloveFile)` on the question, which matches the text file (sorry, I excluded this originally to avoid this becoming a huge post). I think the error is somewhere in the transition between text --> csv --> text. – Jordan MacLachlan May 15 '20 at 00:18
  • 1
    Thanks, but: that doesn't look like the file you'd pass to `load_word2vec_format()` - because it doesn't have the prepended data from the `glove2word2vec()` call. – gojomo May 15 '20 at 00:31
  • Also, I've downloaded the Kaggle file, and unzipped its contents, and my `glove.6B.50d.txt` file is exactly `171350079` bytes long, and has MD5 `0fac3659c38a4c0e9432fe603de60b12`. Do these match your `glove.6B.50d.txt` exactly? – gojomo May 15 '20 at 00:35
  • Apologies - I misunderstood your original request. I've added the additional information. Yes, the MD5 and bytes are identical. – Jordan MacLachlan May 15 '20 at 03:25

1 Answers1

0

In can't reproduce the problem running the following code (on a linux machine, Python 3.6):

In [1]: from gensim.models import KeyedVectors 

In [2]: from gensim.scripts.glove2word2vec import glove2word2vec 

In [3]: glove2word2vec('glove.6B.50d.txt', 'glove.68.50d.w2v.txt')                        
Out[3]: (400000, 50)

In [4]: model = KeyedVectors.load_word2vec_format('glove.68.50d.w2v.txt')                                        

In [5]: len(model)                                                                                               
Out[5]: 400000

In [6]: model['the']                                                                                       

Out[7]: 
array([ 4.1800e-01,  2.4968e-01, -4.1242e-01,  1.2170e-01,  3.4527e-01,
       -4.4457e-02, -4.9688e-01, -1.7862e-01, -6.6023e-04, -6.5660e-01,
        2.7843e-01, -1.4767e-01, -5.5677e-01,  1.4658e-01, -9.5095e-03,
        1.1658e-02,  1.0204e-01, -1.2792e-01, -8.4430e-01, -1.2181e-01,
       -1.6801e-02, -3.3279e-01, -1.5520e-01, -2.3131e-01, -1.9181e-01,
       -1.8823e+00, -7.6746e-01,  9.9051e-02, -4.2125e-01, -1.9526e-01,
        4.0071e+00, -1.8594e-01, -5.2287e-01, -3.1681e-01,  5.9213e-04,
        7.4449e-03,  1.7778e-01, -1.5897e-01,  1.2041e-02, -5.4223e-02,
       -2.9871e-01, -1.5749e-01, -3.4758e-01, -4.5637e-02, -4.4251e-01,
        1.8785e-01,  2.7849e-03, -1.8411e-01, -1.1514e-01, -7.8581e-01],
      dtype=float32)

Do these exact lines trigger the exact same error as originally reported for you? (If you still get an error, but the error is even the slightest bit different, can you add the updated error to your question?)

My best guess if you're still having a problem is some Windows-specific default-encoding mangling during one of the steps, or if the file was opened/saved in some other editor.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • This assumes I'm running the code locally. The main issue here is that I'm trying to run things on Google Colab, which means I have to import the data. i.e. this code won't work on Colab. – Jordan MacLachlan May 15 '20 at 23:13
  • So the problem only happens on Google Colab, and can't be reproduced locally? (If so, that would be very important information! Also, any of your reports of file sizes/MD5s/example-lines from your local WIndows file might not match the file on Colab!) You could try these 4 lines of code locally, or you could put these same 4 lines of code in a `.py` file or Colab notebook & run it there - assuming the required `'glove.6B.50d.txt` file can be made available there. – gojomo May 16 '20 at 00:18
  • I state in line one that I'm working in Colab... almost all of my reports are generated in Colab, save the MD5 key generation. – Jordan MacLachlan May 16 '20 at 00:42
  • OK, I was thrown off by the mention of Windows. You should see if you get the same error locally, & try the exact 4 lines above both locally & on Colab to see where, if anywhere, they throw any specific errors. And check the MD5 of the exact file used on Colab. The main error you've reported **isn't** something that your `...map...` patching would've been indicated as a potential fix - the error doesn't implicate non-ascii characters at all. But the error **does** look like something that could result from misaimed attempts at patching the file. So starting fresh w/ minimal steps makes sense. – gojomo May 16 '20 at 04:17