Questions tagged [snowball]

Snowball is a small language for writing stemming algorithms, used primarily in information retrieval and natural language processing.

Created by Dr. Martin Porter, Snowball is a small string processing language designed for creating stemming algorithms for use in Information Retrieval. It was created partially to provide a canonical implementation of Porter's stemming algorithm, and partially to facilitate the creation of stemmers for languages other than English.

A further aim of Porter's was to provide a way of creating and defining stemmers that could readily or automatically be translated into C, Java, or other programming languages. The Snowball compiler translates a Snowball script (a .sbl file) into either a thread-safe ANSI C program or a Java program. For ANSI C, each Snowball script produces a program file and corresponding header file (with .c and .h extensions).

The name "Snowball" is a tribute to the SNOBOL programming language.

73 questions
36
votes
3 answers

Stemming algorithm that produces real words

I need to take a paragraph of text and extract from it a list of "tags". Most of this is quite straight forward. However I need some help now stemming the resulting word list to avoid duplicates. Example: Community / Communities I've used an…
Dave
  • 828
  • 1
  • 13
  • 18
25
votes
1 answer

Elasticsearch : How to list each analyzer used by a specific index

I need to find out which analyzer (type, language..) is configured in a specific index. I tried http://localhost:9200/wazzup/_mapping but it only gives information about field names/types. Thanks
Spadon_
  • 495
  • 2
  • 5
  • 11
23
votes
3 answers

Lucene Standard Analyzer vs Snowball

Just getting started with Lucene.Net. I indexed 100,000 rows using standard analyzer, ran some test queries, and noticed plural queries don't return results if the original term was singular. I understand snowball analyzer adds stemming support,…
alchemical
  • 13,559
  • 23
  • 83
  • 110
12
votes
2 answers

German Stemming for Sentiment Analysis in Python NLTK

I've recently begun working on a sentiment analysis project on German texts and I'm planning on using a stemmer to improve the results. NLTK comes with a German Snowball Stemmer and I've already tried to use it, but I'm unsure about the results.…
Florian
  • 155
  • 1
  • 9
12
votes
7 answers

Is there a java implementation of Porter2 stemmer

Do you know any java implementation of the Porter2 stemmer(or any better stemmer written in java)? I know that there is a java version of Porter(not Porter2) here : http://tartarus.org/~martin/PorterStemmer/java.txt but on…
Bikash Gyawali
  • 969
  • 2
  • 15
  • 33
8
votes
1 answer

SnowballStemmer for Russian words list

I do know how to perform SnowballStemmer on a single word (in my case, on russian one). Doing the next things: from nltk.stem.snowball import SnowballStemmer stemmer = SnowballStemmer("russian") stemmer.stem("Василий") 'Васил' How can I do the…
Keithx
  • 2,994
  • 15
  • 42
  • 71
8
votes
1 answer

Italian stemming library in java

i'm searching a java library or something to do stemming of italian strings of words. The goal is to compare italian words. In this moment words like "attacco", "attacchi","attaccare" etc., are considered different, instead I want returned a true…
Schiawo
  • 95
  • 7
5
votes
1 answer

Snowball Stemmer Usage

I'd like to use the stemmer here for merging word counts. http://snowball.tartarus.org/download.html The page has a download link, but I'm not sure how to integrate the files into my eclipse project Its not just a jar to drop into my lib folder, its…
LemonMan
  • 2,963
  • 8
  • 24
  • 34
5
votes
0 answers

Adding language to pystemmer

I would like to use pystemmer with whoosh, but there is no support for my language. I found two snowball files for my language (Snowball), and i made *.c files from them as advised here. Now i would like to include *.c files in pystemmer. I added…
5
votes
5 answers

Use multiple stemming languages with ElasticSearch

I'm building a search engine for a website where users can be of many different countries and post text content. I'll consider that: - A french generates content in french and english - A german generates content in german and english etc... What…
Sebastien Lorber
  • 89,644
  • 67
  • 288
  • 419
4
votes
2 answers

Are Snowball & SnowballC packages different in R?

I am using stemDocument for stemming text document using tm package in R. Example code: data("crude") crude[[1]] stemDocument(crude[[1]]) I get an error message: Error in loadNamespace(name) : there is no package called ‘Snowball’ I have…
Ram
  • 331
  • 1
  • 3
  • 11
3
votes
1 answer

Where to find Ukrainian 'ispell', 'aspell', 'snowball' dictionary for adding it to full-text search in Postgres?

After parsing many documents, I have a lot of rows/columns with Ukrainian text that should be indexed for full-text search in Postgres. I've found that Postgres 14 supports by default 29 languages, but unfortunately not the Ukrainian one. After…
3
votes
0 answers

Elasticsearch snowball in French not stemming correctly

I've seen a problem with the same stem word in French. Here is an example: snowball in French or curl -XDELETE http://localhost:9200/stacko36088193 curl -XPOST http://localhost:9200/stacko36088193 -d ' { "index": { "number_of_shards": 1, …
Roukmoute
  • 681
  • 1
  • 11
  • 26
3
votes
1 answer

How to Stem Shakespere/KJV Using nltk.stem.snowball

I want to stem early modern English text: sb.stem("loveth") >>> "lov" Apparently, all I need to do is a small tweak to the Snowball Stemmer: And to put the endings into the English stemmer, the list ed edly ing ingly of Step 1b should be…
Joseph
  • 691
  • 1
  • 4
  • 12
3
votes
1 answer

Snowball Stemming: defining Regions

I'm trying to understand the snoball stemming algorithmus. The algorithmus is using two regions R1 and R2 that are definied as follows: R1 is the region after the first non-vowel following a vowel, or is the null region at the end of the word if…
HW90
  • 1,953
  • 2
  • 21
  • 45
1
2 3 4 5