3

Is there a document somewhere that describes the Mecab algorithm?

Or could someone give a simple one-paragraph or one-page description?

I'm finding it too hard to understand the existing code, and what the databases contain.

I need this functionality in my free website and phone apps for teaching languages (www.jtlanguage.com). I also want to generalize it for other languages, and make use of the conjugation detection mechanism I've already implemented, and I also need it without license encumbrance. Therefore I want to create my own implementation (C#).

I already have a dictionary database derived from EDICT. What else is needed? A frequency-of-usage database?

Thank you.

jtsoftware
  • 521
  • 3
  • 14

1 Answers1

3

Some thoughts that are too long to fit in a comment.

§ What license encumbrances? MeCab is dual-licensed including BSD, so that's about as unencumbered as you can get.

§ There's also a Java rewrite of Mecab called Kuromoji that's Apache licensed, also very commercial-friendly.

§ MeCab implements a machine learning technique called conditional random fields for morphological parsing (separating free text into morphemes) and part-of-speech tagging (labeling those morphemes) Japanese text. It is able to use various dictionaries as training data, which you've seen—IPADIC, UniDic, etc. Those dictionaries are compilations of morphemes and parts-of-speech, and are the work of many human-years worth of linguistic research. The linked paper is by the authors of MeCab.

§ Others have applied other powerful machine learning algorithms to the problem of Japanese parsing.

  • Kytea can use both support vector machines and logistic regression to the same problem. C++, Apache licensed, and the papers are there to read.
  • Rakuten MA is in JavaScript, also liberally licensed (Apache again), and comes with a regular dictionary and a light-weight one for constrained apps—it won't give you readings of kanji though. You can find the academic papers describing the algorithm there.

§ Given the above, I think you can see that simple dictionaries like EDICT and JMDICT are insufficient to do the advanced analysis that these morphological parsers do. And these algorithms are likely way overkill for other, easier-to-parse languages (i.e., languages with spaces).

If you need the power of these libraries, you're probably better off writing a microservice that runs one of these systems (I wrote a REST frontend to Kuromoji called clj-kuromoji-jmdictfurigana) instead of trying to reimplement them in C#.

Though note that it appears C# bindings to MeCab exist: see this answer.

In several small projects I just shell out to MeCab, then read and parse its output. My TypeScript example using UniDic for Node.js.

§ But maybe you don't need full morphological parsing and part-of-speech tagging? Have you ever used Rikaichamp, the Firefox add-on that uses JMDICT and other low-weight publicly-available resources to put glosses on website text? (A Chrome version also exists.) It uses a much simpler deinflector that quite frankly is awful compared to MeCab et al. but can often get the job done.

§ You had a question about the structure of the dictionaries (you called them "databases"). This note from Kimtaro (the author of Jisho.org) on how to add custom vocabulary to IPADIC may clarify at least how IPADIC works: https://gist.github.com/Kimtaro/ab137870ad4a385b2d79. Other more modern dictionaries (I tend to use UniDic) use different formats, which is why the output of MeCab differs depending on which dictionary you're using.

Ahmed Fasih
  • 6,458
  • 7
  • 54
  • 95
  • Thanks, this helps a lot. Yeah, I was confused about the licensing. I was actually remembering the NMeCab SourceForge project, which is a native C# version of MeCab, which is GLPv2. Despite its power, I think MeCab could do a lot better. I don't like how it breaks up conjugated verbs. For example, 持っておられます is split up into 持っ-て-おら-れ-ます. In a post processor, I would have to try to remerge them, as I have code that would recognize "持って" and "おられます" and give inflection information, using just EDICT. I'm going to study it a some more before I give up and use NMeCab. – jtsoftware May 09 '19 at 14:22
  • Update. Sorry, I unmarked this as an answer. I'm still hoping for a simple explanation of the algorithm. The paper, though explaining the problems well, is way too abstract in the actual solution, omitting any discussion of what's actually in the dictionaries and how they are used. It seems I could come up with a graph of all the valid paths through a sentence, tagging unknown character sequences as well, and then use some kind of weighting to pick the best path. Where the weighting comes from is my biggest question. Is it POS-based? I may need to crawl the code. – jtsoftware May 09 '19 at 15:14
  • Yes, morphemes aren't the ideal for learning. Bunsetsu are better: take a look at [JDep.P](http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/jdepp/#dl) which is a C++ post-processor to MeCab that chunks morphemes into bunsetesu, which one site defines as “a phonetic break consisting of a word and its postpositions or suffixes”. See an example of JDep.P output at https://gist.github.com/fasiha/fffa8914d25660859ad97da1eafd92cb. In my language learning software, I use MeCab with bunsetsu to identify conjugated phrases (verb phrases like 持ってあられます (one bunsetsu with UniDic), and い-adjective phrases). – Ahmed Fasih May 10 '19 at 01:09
  • I am hopeful that someone can post reasonably-detailed pseudocode to allow us to implement MeCab's full algorithm but I think the chances are very slim. And even if they did, it'd take a long time to implement all the linear algebra, and test it as extensively as MeCab has been. Because of this, and because parsing only has to happen when text is initially authored, my apps make a networked API call to a microservice running MeCab+JDep.P (or Kuromoji as linked in the answer) and that returns JSON. I store that output in a database for all subsequent processing. – Ahmed Fasih May 10 '19 at 08:50