Questions tagged [stemming]

The process for reducing inflected words to their stem.

In linguistic morphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form

531 questions
114
votes
22 answers

How do I do word Stemming or Lemmatization?

I've tried PorterStemmer and Snowball but both don't work on all words, missing some very common ones. My test words are: "cats running ran cactus cactuses cacti community communities", and both get less than half right. See also: Stemming…
manixrock
  • 2,533
  • 4
  • 24
  • 29
81
votes
4 answers

Stemmers vs Lemmatizers

Natural Language Processing (NLP), especially for English, has evolved into the stage where stemming would become an archaic technology if "perfect" lemmatizers exist. It's because stemmers change the surface form of a word/token into some…
alvas
  • 115,346
  • 109
  • 446
  • 738
45
votes
7 answers

What is the best stemming method in Python?

I tried all the nltk methods for stemming but it gives me weird results with some words. Examples It often cut end of words when it shouldn't do it : poodle => poodl article articl or doesn't stem very good : easily and easy are not stemmed in…
PeYoTlL
  • 3,144
  • 2
  • 17
  • 18
36
votes
3 answers

Stemming algorithm that produces real words

I need to take a paragraph of text and extract from it a list of "tags". Most of this is quite straight forward. However I need some help now stemming the resulting word list to avoid duplicates. Example: Community / Communities I've used an…
Dave
  • 828
  • 1
  • 13
  • 18
33
votes
3 answers

Java library for keywords extraction from input text

I'm looking for a Java library to extract keywords from a block of text. The process should be as follows: stop word cleaning -> stemming -> searching for keywords based on English linguistics statistical information - meaning if a word appears more…
Shay
  • 497
  • 1
  • 4
  • 10
30
votes
2 answers

Lucene Hebrew analyzer

Does anybody know whether one exists? I've been googling this for monthes... Thanks
Roey
  • 849
  • 2
  • 11
  • 20
29
votes
7 answers

Stemming English words with Lucene

I'm processing some English texts in a Java application, and I need to stem them. For example, from the text "amenities/amenity" I need to get "amenit". The function looks like: String stemTerm(String term){ ... } I've found the Lucene Analyzer,…
Mulone
  • 3,603
  • 9
  • 47
  • 69
20
votes
4 answers

User Warning: Your stop_words may be inconsistent with your preprocessing

I am following this document clustering tutorial. As an input I give a txt file which can be downloaded here. It's a combined file of 3 other txt files divided with a use of \n. After creating a tf-idf matrix I received this warning: ,,UserWarning:…
20
votes
4 answers

Tokenizer, Stop Word Removal, Stemming in Java

I am looking for a class or method that takes a long string of many 100s of words and tokenizes, removes the stop words and stems for use in an IR system. For example: "The big fat cat, said 'your funniest guy i know' to the kangaroo..." the…
Phil
  • 665
  • 5
  • 9
  • 14
20
votes
5 answers

Need a python module for stemming of text documents

I need a good python module for stemming text documents in the pre-processing stage. I found this one http://pypi.python.org/pypi/PyStemmer/1.0.1 but i cannot find the documentation int the link provided. I anyone knows where to find the…
Kai
  • 953
  • 6
  • 16
  • 37
16
votes
2 answers

Import WordNet In NLTK

I want to import wordnet dictionary but when i import Dictionary form wordnet i see this error : for l in open(WNSEARCHDIR+'/lexnames').readlines(): IOError: [Errno 2] No such file or directory: 'C:\\Program Files\\WordNet\\2.0\\dict/lexnames' I…
Masoud Abasian
  • 10,549
  • 6
  • 23
  • 22
15
votes
2 answers

nltk stemmer: string index out of range

I have a set of pickled text documents which I would like to stem using nltk's PorterStemmer. For reasons specific to my project, I would like to do the stemming inside of a django app view. However, when stemming the documents inside the django…
jkarimi
  • 1,247
  • 2
  • 15
  • 27
15
votes
3 answers

Converting plural to singular in a text file with Python

I have txt files that look like this: word, 23 Words, 2 test, 1 tests, 4 And I want them to look like this: word, 23 word, 2 test, 1 test, 4 I want to be able to take a txt file in Python and convert plural words to singular. Here's my…
theintern
  • 511
  • 2
  • 6
  • 14
13
votes
1 answer

WordListCorpusReader is not iterable

So, I am new to using Python and NLTK. I have a file called reviews.csv which consists of comments extracted from amazon. I have tokenized the contents of this csv file and written it to a file called csvfile.csv. Here's the code : from…
Aarushi Aiyyar
  • 369
  • 1
  • 5
  • 11
12
votes
4 answers

The reverse process of stemming

I use a lucene snowball analyzer to perform stemming . The results are not meaningful words . I referred this question . One of the solution is to use a database that contains a map between the stemmed version of the word to one stable version of…
CTsiddharth
  • 907
  • 12
  • 21
1
2 3
35 36