4

I'm trying to analyze the texts in Italian in R. As you do in a textual analysis I have eliminated all the punctuation, special characters and Italian stopwords. But I have got a problem with Stemming: there is only one Italian stemmer (Snowball), but it is not very precise.

To do the stemming I used the tm library and in particular the stemDocument function and I also tried to use the SnowballC library and both lead to the same result.

  stemDocument(content(myCorpus[[1]]),language = "italian")

The problem is that the resulting stemming is not very precise. Are there other more precise Italian stemmers? or is there a way to implement the stemming, already present in the TM library, by adding new terms?

NelsonGon
  • 13,015
  • 7
  • 27
  • 57
  • Welcome to SO! As I faced the same problem, I create my own function of stemming, it's not too much complex, however I had quite simple texts to work with. – s__ Aug 21 '19 at 13:19
  • 1
    Thank you very much for your answer. I think this is the solution :create a function of stemming. – Danny Paganin Aug 21 '19 at 13:34

1 Answers1

2

Another alternative you can check out is the package from this person, he has it for many different languages. Here is the link for Italian.

Whether it will help your case or not is another debate but it can also be implemented via the corpus package. A sample example (for English use case, tweak it for Italian) is also given in their documentation if you move down to the Dictionary Stemmer section


Alternatively, similar to the above way, you can also consider the stemmers or lemmatizers (if you havent considered lemmatizers, they are worth considering) from Python libraries such as NLTK or Spacy and check if you are getting better resutls. After all, they are just files containing mappings of root word vs child words. Download them, fine tune the file to your requirement, and use the mappings as per your convenience by passing it via a custom made function.
NelsonGon
  • 13,015
  • 7
  • 27
  • 57
Ankur Sinha
  • 6,473
  • 7
  • 42
  • 73
  • 1
    About stemmers vs lemmatizers [this](https://stackoverflow.com/questions/17317418/stemmers-vs-lemmatizers?rq=1) is quite interesting. – s__ Aug 21 '19 at 13:42
  • 2
    Yes, I've read that. In fact, even during my thesis, I tried both approaches, and I have had more success with lemmatization. Having said that, there were cases were stemmers were sufficient. I guess it boils down to experiment and the final needs. It's difficult to say one is better than the other. – Ankur Sinha Aug 21 '19 at 13:50