3
tokens = [The, wage, productivity, nexus, the, process, of, development,....]

I am trying to convert a list of tokens into their lemmatized form using SpaCy's Lemmatizer. Here is the documentation I am using.

My code:

from spacy.lemmatizer import Lemmatizer
from spacy.lookups import Lookups
lookups = Lookups()
lookups.add_table("lemma_rules")
lemmatizer = Lemmatizer(lookups)
lemmas = []
for tokens in filtered_tokens:
    lemmas.append(lemmatizer(tokens))

error message

TypeError                                 Traceback (most recent call last)
 in 
      7 lemmas = []
      8 for tokens in filtered_tokens:
----> 9     lemmas.append(lemmatizer(tokens))

TypeError: __call__() missing 1 required positional argument: 'univ_pos'

I understood in this discussion how SpaCy's Lemmatizer works and understand it in theory. However, I am not sure how I can implement this.

How can I find out the univ_pos for my tokens?

sophros
  • 14,672
  • 11
  • 46
  • 75
sheth7
  • 351
  • 2
  • 14
  • [UPOS tags](https://universaldependencies.org/u/pos/) are things like NOUN, VERB,... Generally when you run spaCy you parse a sentence, which tags each word with these tags. The lemma functionality is them available in the `.lemma_` attribute. If you can't parse a full sentence you'll have to apply the tags manually. If your tokens are spaCy `Tokens` you should be able to just call `.lemma_` to get the lemma. – bivouac0 Feb 16 '20 at 22:19
  • when parsing, I am using spacy's 'Tokenizer pipeline' https://spacy.io/api/tokenizer . Do you know if I can get the upos tags with the tokenizer pipeline? Thanks. – sheth7 Feb 16 '20 at 22:22

1 Answers1

0

Here's an example adapted from the spaCy documentation...

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text, token.pos_, token.lemma_)

Here the .pos_ gives you the universal dependencies part of speech you're looking for in your original question.

However, tagging, lemmatizing, etc.. requires a full pipeline of components. There is a tagger for adding the pos data. If the Tokenizer is the only function in your pipeline then you probably won't have the pos info.

bivouac0
  • 2,494
  • 1
  • 13
  • 28
  • This does not perform the tokenization using the SpaCy tokenization pipeline and thus is not as fast. I would like to use the pipeline to tokenize and get the pos tags. Thoughts? – sheth7 Feb 16 '20 at 22:29
  • The above example uses the spaCy pipeline. I don't understand what you're asking. Can you explain your problem in more detail and state why the above example won't work? – bivouac0 Feb 16 '20 at 22:38
  • This method will work but it is not as fast as using the SpaCy Tokenizer pipeline. See this - spacy.io/api/tokenizer . The Tokenizer pipeline allows batch processing etc. – sheth7 Feb 16 '20 at 22:40
  • It's faster because it's not tagging (at least that's one reason). SpaCy's lemmatizer needs to know the upos so if you're going to use it, you need to tag. You can look at [lemminflect.getAllLemmas](https://lemminflect.readthedocs.io/en/latest/lemmatizer/). Lemminflect uses a different approach (dictionary based) to lemmatize and might work a bit better for you. Note that there are a few words that might return different lemmas if you don't know the tag but you can correctly lemmatize most without. – bivouac0 Feb 16 '20 at 22:47