2

I want to use Lithuanian language stemmer in Python, however, there is no Lithuanian language in common tools like NLTK.

However, I could find snowball .sbl files of Lithuanian stemmers here and here.

But how to use them in Python?

What I was able to found is command line approach to get .c files. But what next?

As is stated in snowball official page, there is PyStemmer - a Python interface for snowball. But there I could not find any way to use new or custom .sbl algorithms.

So how to get new .sbl algorithm to Python?

Lukas
  • 160
  • 2
  • 8

1 Answers1

0

As of right now, Lithuanian language was added to Snowball git repo, but pyStemmer uses an old version of that repo which doesn't contain it. I didn't manage to install new version of Snowball in python correctly, but instead used c executable with python subprocess module.

For that you just need to clone the repository, install it with command make and then you get stemwords executable. You could test the Lithuanian language with it using command in unix terminal ./stemwords -l lt and then enter words the words you would like to be processed.

Using it with python's subprocess for process files which contain words to be stemmed line by line:

import subprocess
args = ("./stemwords", "-l", "lt", "-i", "input_file.txt", "-o", "output_file.txt")
popen = subprocess.Popen(args, stdout=subprocess.PIPE)
popen.wait()

input file:

Kodėl
moteriai
vienišai
ištekėjusiai

output file:

kod
mot
vieniš
ištekėjus
Paulius Venclovas
  • 1,337
  • 1
  • 10
  • 15
  • How to install it on windows? [link] https://sourceforge.net/projects/gnuwin32/ `make` on windows throws an error: ```The system cannot find the path specified. GNUmakefile:48: algorithms.mk: No such file or directory make: *** No rule to make target `algorithms.mk'. Stop.``` – Lukas Oct 20 '18 at 08:18