0

I am working with an nlp model where I'd like to normalize the nlp.vocab.vectors. From the documentation about spacy vectors it states that it's an numpy ndarray.

I've googled a fair bit about normalizing numpy arrays as stated here, here and here.

As such I tried the following 3 approaches;

import spacy
import numpy as np
nlp = spacy.load('en_core_web_lg')

matrix = nlp.vocab.vectors # Shape (514157, 300)

# Approach 1
matrix_norm1 = matrix/np.linalg.norm(matrix) 
print(matrix_norm1.shape) # Shape (514157,)

# Approach 2
#matrix_norm2 = matrix / np.sqrt(np.sum(matrix**2))
## Results in TypeError: unsupported operand type(s) for ** or pow(): 'spacy.vectors.Vectors' and 'int'

# Approach 3
matrix_norm3 = matrix / (np.mean(matrix) - np.std(matrix))
print(matrix_norm3.shape) # => Shape (514157,)

The two approaches that returns a result does so but it doesn't retain the dimensions (514157, 300). Any suggestions on how I can do this?

OLGJ
  • 331
  • 1
  • 7

1 Answers1

1

nlp.vocab.vectors is a Vectors object. The numpy array is stored in nlp.vocab.vectors.data. See: https://spacy.io/api/vectors

aab
  • 10,858
  • 22
  • 38
  • So how can I normalize the `nlp.vocab.vectors`? – OLGJ Nov 30 '22 at 12:25
  • I'm not sure what you're trying to do? It's fine if you're just extracting the vectors for use elsewhere, but if you modify the vector table in `nlp.vocab.vectors.data` it will break all of the statistical components in the `en_core_web_lg` pipeline. – aab Nov 30 '22 at 13:07
  • I'm trying to achieve what they did [here](https://aclanthology.org/N13-1090.pdf) in section 5 – OLGJ Nov 30 '22 at 15:22
  • You can extract the whole in the vector table into word2vec format as shown here: https://github.com/explosion/spaCy/discussions/6061#discussioncomment-189609. Then you can work with the vector table however you'd like. – aab Nov 30 '22 at 17:16