4

Is there any modern part-of-speech tagger + dependency parser for Russian language? I need a tool or service that will be able to process plain text and output:

  • division into sentences
  • division into tokens
  • part-of-speech tags (fine-grained MSD tags are welcome)
  • lemmas (base forms)
  • dependency role labels

I need the tool for commercial purposes. It could be either an open-source project with a trained statistical model that can be used for commercial purposes (purchased if needed) or an web API. Eventually it could be a proprietary closed-source binary with a proprietary model. The parsing models for Russian than I've found online all require the use of TreeTagger, which 1) has a very unfriendly licence, 2) is over 20 years old.

adam.ra
  • 1,068
  • 1
  • 10
  • 16

1 Answers1

1

In order to build a (good) dependency parser you require a dependency treebank. All the teams who build their dependency parsers have access to such treebanks, but they would not be allowed to pass on the data. Therefore you can get the parser, but usually not the pretrained model.

That is why you have to train a model yourself. For Russian there exists a dependency treebank (SynTagRus). I don't know whether you will be able to get it for commercial purposes. Maybe these sites will help you:

https://github.com/UniversalDependencies/UD_Russian-SynTagRus
https://habrahabr.ru/post/148124/
http://www.ruscorpora.ru/index.html

If you manage to get the data, training your own model is a very easy task. Either ask here again or you will definitely find enough guides on the internet (training a parser is quite the same whether it is Russian or any other language)

Volokh
  • 380
  • 3
  • 16