1

I have already train the model by Word2vec in Python, and save the vector(which is size = 300) corresponding for all those words as in vec.txt file, now if I got one word, which I need get the corresponding vectors and do some aclatue for those vectors. But I do not know how to get those vectors from the txt file.
Following are part of vec.txt:

new -0.000113 0.000211 -0.000170 0.000346 -0.000251 -0.001012 0.001647 -0.001331 0.001267 0.000876 0.001243 -0.000600 -0.000667 -0.001241 0.001204 -0.000726 -0.001023 0.001476 -0.001380 0.000065 0.000145 0.001451 0.001275 0.001482 -0.001011 0.001131 0.001095 -0.001637 0.000289 -0.000846 0.001599 -0.001027 -0.000768 -0.000595 0.000825 0.000639 -0.001097 -0.001654 -0.000977 -0.000351 0.001410 0.001182 0.000318 -0.000454 -0.000622 0.000343 0.000508 -0.000258 0.001347 0.000362 0.000372 -0.000208 0.000896 0.001408 0.001412 -0.001566 0.001642 -0.000865 -0.000656 0.001095 -0.001503 -0.000483 0.000465 0.001352 0.000602 -0.000017 0.000011 0.001219 0.001363 0.001296 -0.000474 0.000718 -0.000544 0.000779 -0.001225 -0.001141 -0.001061 -0.000550 0.001446 0.000735 0.001267 0.001269 0.001115 0.001023 0.001564 -0.000947 0.000320 -0.001648 0.001605 -0.000900 -0.000734 -0.000344 0.000376 -0.001550 0.001241 0.000294 0.000207 -0.001420 0.000297 0.001122 0.000834 -0.001423 -0.001499 0.001060 0.000898 0.001609 -0.000512 -0.001185 -0.001648 0.001328 0.001620 0.001344 0.000160 0.000567 -0.001665 -0.000246 -0.000274 0.001234 0.000659 0.000144 -0.001370 0.001457 -0.000025 0.001117 0.000249 0.000137 -0.000048 -0.000527 -0.000428 0.000305 -0.001058 0.001374 0.000369 0.001588 0.000085 0.000749 -0.001584 0.000918 -0.001196 0.000424 0.000651 -0.001387 0.000815 -0.000959 0.001261 -0.001246 0.000258 -0.000887 0.001583 0.000102 -0.001337 0.000428 -0.000004 0.000131 0.000487 -0.001659 0.000093 0.001464 0.000356 -0.001479 -0.001217 -0.000626 0.001019 0.001179 -0.000599 0.000825 0.000858 -0.000841 0.000399 -0.001587 -0.000923 -0.000496 -0.000668 0.000567 0.001308 0.001042 -0.000676 0.001292 -0.001345 0.000113 0.000021 -0.000577 0.000292 0.001052 -0.001646 -0.001186 0.000184 0.000747 -0.001190 -0.001472 0.000535 0.000199 0.000522 -0.000229 -0.000277 -0.000136 0.001568 -0.000509 -0.000065 0.000305 0.001245 -0.001371 -0.001378 -0.000742 0.000411 -0.000461 0.001547 0.001272 0.001339 0.000181 -0.001335 0.000257 -0.000001 0.001494 -0.001379 -0.000635 -0.001195 -0.001483 0.000744 -0.000203 0.000407 -0.000061 -0.001561 0.000239 0.000370 0.000227 -0.000043 -0.001377 -0.000961 -0.001038 0.001575 0.000618 0.000218 0.001260 0.000971 0.000572 0.001307 0.000362 -0.000844 -0.000281 0.000440 -0.001122 0.000097 0.001392 0.000427 0.000913 -0.000537 -0.000889 0.000799 -0.001422 0.001501 0.001130 -0.000633 -0.000747 0.001198 0.000235 0.001335 0.000273 -0.000906 -0.000551 0.000527 0.000900 -0.001294 0.000451 -0.001180 -0.001376 0.000287 0.001508 0.000068 0.000225 0.000504 0.000137 -0.001071 -0.001383 0.001414 -0.000946 0.001358 -0.001146 -0.000623 0.000656 0.001605 0.000519 0.000106 0.001341 -0.000560 -0.001359 0.000721 0.001653 -0.000643 0.000625 0.000133 -0.000321 0.001230 0.000046 -0.001030 0.000752 0.000108 0.001263 0.000562 0.001224

if I got 'new', i need get 300 corresponding vectors for new from vec.txt file.

kvorobiev
  • 5,012
  • 4
  • 29
  • 35

1 Answers1

0

You can read the file, split it at spaces, remove the first word ('new') and convert the resulting 300 strings to floats.

with open('vec.txt') as f:
   file_string = f.read().strip()
   numbers = [float(s) for s in file_string.split()[1:]]
   print numbers
Jakube
  • 3,353
  • 3
  • 23
  • 40
  • Hi, may I ask you one question, I got no access to ask new question in here, only can add comments, if you can answer my question, I will be appreciate for it. here I got list k: k = [[u'\u6c34\u679c'], [u'\u996e\u54c1'], [], [u'\u725b\u5c0f\u6392'], [u'\u9999\u828b'], [u'\u6155\u65af'], [u'\u5feb\u9910'], [u'\u6930\u5b50' u'\u571f\u53f8']] do you know how to remove empty "[]" element in k, you can see there are one empty [] in the list. –  Aug 21 '15 at 03:07
  • `filtered = [item for item in k if k]` – Jakube Aug 21 '15 at 04:31
  • filter = [item for item in k if k=='']? i need get back k = [[u'\u6c34\u679c'], [u'\u996e\u54c1'], [u'\u725b\u5c0f\u6392'], [u'\u9999\u828b']] which have removed [] –  Aug 21 '15 at 05:36
  • @ArrayNo1 Sorry, I made a mistake. It should be: `filtered = [item for item in k if item]`. Check it out here: https://ideone.com/NX9awX – Jakube Aug 21 '15 at 06:22
  • Btw, these things are call list comprehensions. You should read about them. They are super easy and very useful. – Jakube Aug 21 '15 at 06:22
  • thanks~~!! very much –  Aug 21 '15 at 08:03
  • that is working, thanks a lot. I am spending time for sort for k = [[[u'\u5496\u5561'], 0.045930384670007499], [[u'\u5de7\u514b\u529b'], -0.068261551870430676], [[u'\u6c34\u679c'], 0.0070263516632489802], [[u'\u6c34\u679c'], 0.0070263516632489802], [[u'\u6c34\u679c', u'\u86cb\u7cd5'], 0.030538705504949332], [[u'\u86cb\u7cd5'], 0.038854468196036926], [[u'\u5976\u6cb9'], -0.072265207412896187], [[u'\u5976\u6cb9'], -0.072265207412896187]] by second element of this list(which is numbers), but it not working, when I am type print(k[1]), it print second element of the list, I though it –  Aug 21 '15 at 08:26
  • I though it should print all the numbers in the list, not [u'\u5de7\u514b\u529b'], -0.068261551870430676], –  Aug 21 '15 at 08:27
  • No, `print k[1]` only prints the second element in the list. `print k[0][1]`, `print k[1][1]`, ... gives you the things you want. And if you want the second element of each element in `k`, you can do this again with list comprehensions: `seconds_elements = [item[1] for item in k]`. – Jakube Aug 21 '15 at 08:34
  • thanks ! Now I know it. it works too –  Aug 21 '15 at 08:51
  • so sorry, may I ask you last question more, UnicodeEncodeError: 'gbk' codec can't encode character u'\ue056' in position 0: illegal multibyte sequence there are decoding problem –  Aug 21 '15 at 09:29
  • @ArrayNo1 no idea, maybe try something like this: http://stackoverflow.com/questions/3218014/unicodeencodeerror-gbk-codec-cant-encode-character-illegal-multibyte-sequen – Jakube Aug 21 '15 at 09:34
  • Thanks for help and time~ –  Aug 21 '15 at 09:45
  • Hi, thanks for all your help, and I am wondering, if you know: such situation, which is using trained model to test the data, insure the input data are no problem, but the program is running all the time, not stop, and no showing any error back says it is not correct or anything, just running all the time, and in the output file, also nothing print. –  Aug 22 '15 at 06:50