Questions tagged [lemmatization]

Lemmatization in linguistics is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item.

436 questions
202
votes
14 answers

What is the difference between lemmatization vs stemming?

When do I use each ? Also...is the NLTK lemmatization dependent upon Parts of Speech? Wouldn't it be more accurate if it was?
TIMEX
  • 259,804
  • 351
  • 777
  • 1,080
114
votes
22 answers

How do I do word Stemming or Lemmatization?

I've tried PorterStemmer and Snowball but both don't work on all words, missing some very common ones. My test words are: "cats running ran cactus cactuses cacti community communities", and both get less than half right. See also: Stemming…
manixrock
  • 2,533
  • 4
  • 24
  • 29
81
votes
4 answers

Stemmers vs Lemmatizers

Natural Language Processing (NLP), especially for English, has evolved into the stage where stemming would become an archaic technology if "perfect" lemmatizers exist. It's because stemmers change the surface form of a word/token into some…
alvas
  • 115,346
  • 109
  • 446
  • 738
77
votes
8 answers

wordnet lemmatization and pos tagging in python

I wanted to use wordnet lemmatizer in python and I have learnt that the default pos tag is NOUN and that it does not output the correct lemma for a verb, unless the pos tag is explicitly specified as VERB. My question is what is the best shot…
user1946217
  • 1,733
  • 6
  • 31
  • 40
39
votes
6 answers

How to use spacy's lemmatizer to get a word into basic form

I am new to spacy and I want to use its lemmatizer function, but I don't know how to use it, like I into strings of word, which will return the string with the basic form the words. Examples: 'words'=> 'word' 'did' => 'do' Thank you.
yi wang
  • 403
  • 1
  • 4
  • 8
31
votes
2 answers

word2vec lemmatization of corpus before training

Word2vec seems to be mostly trained on raw corpus data. However, lemmatization is a standard preprocessing for many semantic similarity tasks. I was wondering if anybody had experience in lemmatizing the corpus before training word2vec and if this…
Luca Fiaschi
  • 3,145
  • 7
  • 31
  • 44
30
votes
6 answers

How to perform Lemmatization in R?

This question is a possible duplicate of Lemmatizer in R or python (am, are, is -> be?), but I'm adding it again since the previous one was closed saying it was too broad and the only answer it has is not efficient (as it accesses an external…
StrikeR
  • 1,598
  • 5
  • 18
  • 35
29
votes
5 answers

Lemmatize French text

I have some text in French that I need to process in some ways. For that, I need to: First, tokenize the text into words Then lemmatize those words to avoid processing the same root more than once As far as I can see, the wordnet lemmatizer in the…
yelsayed
  • 5,236
  • 3
  • 27
  • 38
23
votes
13 answers

How to turn plural words singular?

I'm preparing some table names for an ORM, and I want to turn plural table names into single entity names. My only problem is finding an algorithm that does it reliably. Here's what I'm doing right now: If a word ends with -ies, I replace the…
Dmitri Nesteruk
  • 23,067
  • 22
  • 97
  • 166
19
votes
2 answers

Is it possible to speed up Wordnet Lemmatizer?

I'm using the Wordnet Lemmatizer via NLTK on the Brown Corpus (to determine if the nouns in it are used more in their singular form or their plural form). i.e. from nltk.stem.wordnet import WordNetLemmatizer l = WordnetLemmatizer() I've noticed…
ess
  • 313
  • 5
  • 12
18
votes
2 answers

Sklearn: adding lemmatizer to CountVectorizer

I added lemmatization to my countvectorizer, as explained on this Sklearn page. from nltk import word_tokenize from nltk.stem import WordNetLemmatizer class LemmaTokenizer(object): def __init__(self): self.wnl =…
Rens
  • 492
  • 1
  • 5
  • 14
15
votes
3 answers

How does spacy lemmatizer works?

For lemmatization spacy has a lists of words: adjectives, adverbs, verbs... and also lists for exceptions: adverbs_irreg... for the regular ones there is a set of rules Let's take as example the word "wider" As it is an adjective the rule for…
Luis Ramon Ramirez Rodriguez
  • 9,591
  • 27
  • 102
  • 181
14
votes
3 answers

Multilingual NLTK for POS Tagging and Lemmatizer

Recently I approached to the NLP and I tried to use NLTK and TextBlob for analyzing texts. I would like to develop an app that analyzes reviews made by travelers and so I have to manage a lot of texts written in different languages. I need to do two…
Alessio Schiavelli
  • 161
  • 1
  • 1
  • 6
11
votes
1 answer

WordNetLemmatizer not returning the right lemma unless POS is explicit - Python NLTK

I'm lemmatizing the Ted Dataset Transcript. There's something strange I notice: Not all words are being lemmatized. To say, selected -> select Which is right. However, involved !-> involve and horsing !-> horse unless I explicitly input the 'v'…
FlyingAura
  • 1,541
  • 5
  • 26
  • 41
11
votes
1 answer

Is there a good stemmer for Hebrew?

I am looking for a good stemmer for Hebrew - I found nothing at all using Google... On the HebMorph site it says that: Stem and Lemma originally have different meanings, but for Semitic languages they seem to be used interchangeably. Does that mean…
Cheshie
  • 2,777
  • 6
  • 32
  • 51
1
2 3
29 30