3

I'm trying to understand how to prepare paragraphs for ELMo vectorization.

The docs only show how to embed multiple sentences/words at the time.

eg.

sentences = [["the", "cat", "is", "on", "the", "mat"],
         ["dogs", "are", "in", "the", "fog", ""]]
elmo(
     inputs={
          "tokens": sentences,
          "sequence_len": [6, 5]
            },
     signature="tokens",
     as_dict=True
    )["elmo"]

As I understand, this will return 2 vectors each representing a given sentence. How would I go about preparing input data to vectorize a whole paragraph containing multiple sentences. Note that I would like to use my own preprocessing.

Can this be done like so?

sentences = [["<s>" "the", "cat", "is", "on", "the", "mat", ".", "</s>", 
              "<s>", "dogs", "are", "in", "the", "fog", ".", "</s>"]]

or maybe like so?

sentences = [["the", "cat", "is", "on", "the", "mat", ".", 
              "dogs", "are", "in", "the", "fog", "."]]
Palak Bansal
  • 810
  • 12
  • 26
tensa11
  • 113
  • 1
  • 2
  • 7

1 Answers1

1

ELMo produces contextual word vectors. So the word vector corresponding to a word is a function of the word and the context, e.g., sentence, it appears in.

Like your example from the docs, you want your paragraph to be a list of sentences, which are lists of tokens. So your second example. To get this format, you could use the spacy tokenizer

import spacy

# you need to install the language model first. See spacy docs.
nlp = spacy.load('en_core_web_sm')

text = "The cat is on the mat. Dogs are in the fog."
toks = nlp(text)
sentences = [[w.text for w in s] for s in toks.sents]

I don't think you need the extra padding "" on the second sentence as sequence_len takes care of this.

Update:

As I understand, this will return 2 vectors each representing a given sentence

No, this will return a vector for each word, in each sentence. If you want the whole paragraph to be the context (for each word), just change it to

sentences = [["the", "cat", "is", "on", "the", "mat", "dogs", "are", "in", "the", "fog"]]

and

...
"sequence_len": [11]
al0
  • 308
  • 1
  • 2
  • 13
  • I am more interested in how the elmo consumes my input data (tokens). Can the whole paragraph be represented like my 2nd example? Will the words sentences in paragraph be correctly contextualized? Does ELMo even care if it recieves a sentence or a whole paragraph? – tensa11 Dec 01 '18 at 20:31
  • Yes, from an engineering standpoint, the paragraph can be the context. But note it will consume much more memory and have trouble scaling for longer contexts. Whether this is a good idea or not, you may want to experiment. – al0 Dec 01 '18 at 20:33
  • You may want to use [allennlp](https://github.com/allenai/allennlp), which is written by the ELMo authors. I found it easier to use than the tf module. – al0 Dec 01 '18 at 20:35
  • I guess you should tokenize as sentences as you did. Then average all word vectors in a sentence. Then average all the sentence vectors. Thoughts? – Isbister Dec 06 '18 at 19:21