5

I am very confused about how word vectors work, specifically in regards to spacy's entity linking (https://spacy.io/usage/training#entity-linker).

When adding an entity to a knowledge base, one of the parameters is the entity_vector. How do you get this? I have tried doing

nlp = spacy.load('en_core_web_sm')
kb = KnowledgeBase(vocab = nlp.vocab, entity_vector_length = 96)
for n in m_yentities:
    kb.add_entity(entity = n, freq = ___, entity_vector = **nlp(n).vector**)

The ** code gives me vectors of length 96, and so that's what I use for entity_vector_length, although in the example they use 3. I am just wondering if my approach is okay but I am kind of confused all around about this.

xibalba1
  • 538
  • 8
  • 16
formicaman
  • 1,317
  • 3
  • 16
  • 32

1 Answers1

3

We'll have to document this better, but let me try and explain: the KnowledgeBase stores pretrained entity vectors. These vectors are condensed versions of the descriptions of the entities. While such a description can be one or multiple words (varying length), its vector should always have a fixed size. A length of 3 is unrealistic, something like 64 or 96 makes more sense. If we have that, each entity description is mapped in a 96D space, so that we can use these descriptions in further downstream neural networks.

As shown in the example you linked, you can use the EntityEncoder to create this mapping of a multi-word description to a 96D vector, and you can play around with the length of the embeddings. Larger embeddings mean that you can capture more information, but will also require more storage.

The creation of these embedding vectors for the entity descriptions are done as an offline step, once when creating the KnowledgeBase. Then when you actually want to train a neural network to do entity linking, the size of that network will depend on the size you've chosen for your description embeddings.

Intuitively, the "entity embeddings" are a sort of averaged, condensed version of the word vectors of all the words in the entity's description.

Also, I don't know if you've seen this, but if you're looking for a more realistic way of running the Entity Linking, you can check out the scripts for processing Wikipedia & Wikidata here.

iron9
  • 397
  • 2
  • 12
Sofie VL
  • 2,931
  • 2
  • 12
  • 22
  • Thanks! But what if I am not doing anything with wikidata? For example, I just want to read entities from a csv as the entities in my knowledge base and then give them aliases. I was able to create a KB, and used "nlp(n).vector" for "entity_vector", where nlp = 'en_core_web_sm' and n is a name of an entity. Does this seem reasonable? – formicaman Jan 23 '20 at 18:40
  • Oh, that's an entirely different story. The underlying assumption of the current EL model is that a description of an entity will semantically be similar to the sentence/context in which an entity is used. So short answer: you need descriptions (and their respective embeddings) for the current EL algo to work. You can take those descriptions from Wikidata, or from somewhere else. – Sofie VL Jan 23 '20 at 20:22
  • Ok, so the descriptions are needed and the entity_vectors are vector representations of the descriptions, not the actual entities, correct? – formicaman Jan 23 '20 at 21:00
  • Thanks! One last question - in the wikipedia_pretrain_kb.py file, it has multiple lines such as "entity_defs_path = loc_entity_defs if loc_entity_defs else output_dir / ENTITY_DEFS_PATH"...this gives me an error saying you cant divide a string by a string. What exactly is this supposed to be doing? – formicaman Jan 24 '20 at 15:04
  • Ah, the "/" syntax is from pathlib, I find that an extremely intuitive library to define paths in a platform-independent way. The script is set up to expect Path variables and will parse them as such, but if you're only copy-pasting parts of it, you probably need to do something like `p = Path('yourlocation')` and then you can concatenate strings to it, and the resulting objects will also be `Path`'s. – Sofie VL Jan 24 '20 at 16:01
  • Thanks again! Didn't realize it needed to be a path object. The only problem I have to now deal with is the size of the dumps. – formicaman Jan 24 '20 at 16:27
  • I agree that the documentation lacks clarity in this regard. Also, the link to the example is dead. – iron9 May 22 '20 at 15:48
  • Thanks, fixed the link, the file was renamed. – Sofie VL May 23 '20 at 17:41