0

i got a large textfile (https://int-emb-word2vec-de-wiki.s3.eu-central-1.amazonaws.com/vectors.txt) and put the file into a dictionary:

word2vec = "./vectors.txt"

with open(word2vec, 'r') as f:
    file = csv.reader(f, delimiter=' ')
    model = {k: np.array(list(map(float, v))) for k, *v in file}

So i got this dictionary: {Word: Embedding vectors}.

Now I want to convert my key from: b'Word' to: Word (so that I got for example UNK instead of b'UNK').

Does anyone know how I can remove the b'...' for every instance? Or is it easier if i first remove all the b'...' in the textfile before I put the file into a dictionary?

AMC
  • 2,642
  • 7
  • 13
  • 35
Maxl Gemeinderat
  • 197
  • 3
  • 14

3 Answers3

0

why not just str.decode() it?

the line would be

model = {k.decode(): np.array(list(map(float, v))) for k, *v in file}
aaron
  • 257
  • 6
  • 15
0

Its not possible to change the Keys. You will need to add a new key with the modified value then remove the old one, or create a new dict with a dict comprehension or the like.

Sai prateek
  • 11,842
  • 9
  • 51
  • 66
-1

Now I want to convert my key from: b'Word' to: Word (so that I got for example UNK instead of b'UNK').

The keys you get are strings like "b'Word'" and "b'UNK'", not b'Word' and b'UNK'. Try executing print(b"Word", type(b"Word"), "b'Word'", type("b'Word'")), it might make things clearer.

This should work:

import ast
import csv

import numpy as np

with open("../out/out_file.txt") as file_in:
    reader = csv.reader(file_in, delimiter=" ")
    words = {ast.literal_eval(word).decode(): np.array(vect, dtype=np.float64) for word, *vect in reader}

This solution also appears to be much faster.

AMC
  • 2,642
  • 7
  • 13
  • 35