1

By non-English alphabet I mean languages like Urdu, Hindi etc. Can someone suggest me pathway?

PS: Not to be marked duplicate of Lemmatization of non-English words?. The context here is different. I mean languages that do not use the English alphabet at all. The other question refers in general to languages that are not English.

Community
  • 1
  • 1
djokester
  • 567
  • 9
  • 20
  • For Hindi, did you see http://stackoverflow.com/questions/4007558/is-there-is-any-stemmer-available-for-indian-language ? – fvu Mar 09 '17 at 16:01
  • @fvu a lemmatizer would have been better – djokester Mar 09 '17 at 16:10
  • 1
    better, but a lot more complicated as well. There are some research papers on that topic floating around, start by reading these papers. – fvu Mar 09 '17 at 16:20

1 Answers1

2

There is no difference between lemmatizing languages written in the Latin, Arabic, Devanagari or Cyrillic script. Unicode allows all of these scripts (and many others) to be represented and treated the same way, so as long as the writing system is based on pronunciation, the same technologies and algorithms can be used for lemmatization.

So technically there is no difference between your question and the question you linked to, "Lemmatization of non-English words?". Still, I'm not marking it as a duplicate since your real question is "How to lemmatize Hindi/Urdu", and this question is not answered there since this language is not supported by Pattern.

Community
  • 1
  • 1
alexis
  • 48,685
  • 16
  • 101
  • 161