4

I'm using fugashi to extract words from sentences. How do I add new terms that are not in the fugacy dictionary to the dictionary?

For example, YouTube is divided into "You" and "Tube."

import fugashi
tagger = fugashi.Tagger()
nodes = tagger.parseToNodeList("ユーチューブ")
goodpos = ['名詞']
nodes = [nn.surface for nn in nodes if nn.feature.pos1 in goodpos]

=> ['ユー', 'チューブ']

Penguin_.
  • 71
  • 1
  • 2

1 Answers1

3

I haven't gotten around to making a proper guide for this yet, but basically you should follow the MeCab docs, but you can use fugashi-build-dict instead of mecab-dict-index.

To give brief instructions, first you need to make a CSV file that uses the same format as your system dictionary. This is based on unidic-lite.

令和,4786,4786,8205,名詞,固有名詞,一般,*,*,*,レイワ,令和,令和,レーワ,令和,レーワ,固,*,*,*,*,*,*,*,レイワ,レイワ,レイワ,レイワ,"1,0",*,*,*,*
㋿,5969,5969,2588,補助記号,一般,*,*,*,*,,㋿,㋿,,㋿,,記号,*,*,*,*,*,*,*,,,,,*,*,*,*,999999
㋿,4786,4786,3992,名詞,固有名詞,一般,*,*,*,レイワ,令和,㋿,レーワ,㋿,レーワ,固,*,*,*,*,*,*,*,レイワ,レイワ,レイワ,レイワ,"1,0",*,*,*,*
夢夢,4786,4786,8205,名詞,固有名詞,一般,*,*,*,レイワ,令和,令和,レーワ,令和,レーワ,固,*,*,*,*,*,*,*,レイワ,レイワ,レイワ,レイワ,"1,0",*,*,*,*

You can make this by copying entries from the UniDic source and editing fields. Then you run this command:

fugashi-build-dict -d dicdir/ -u mydic.dic mydic.csv

dicdir is the location of your system dictionary, mydic.csv is the csv file you made. This will create the mydic.dic file, which you can then use with fugashi by specifying -u mydic.dic.

polm23
  • 14,456
  • 7
  • 35
  • 59
  • 1
    Is there a guide to the fields of unidic-lite, specifically what those three numbers in the beginning are? I think [this example](https://gist.github.com/Kimtaro/ab137870ad4a385b2d79) is for IPADIC and there, the first three numbers are `left_context_id` and `right_context_id` (ok to be -1) and `cost`. Any guidance on what numbers to use for ユーチューブ? – Ahmed Fasih May 17 '21 at 20:46
  • You are right about those first three numbers - those are the same in every MeCab dictionary, see the MeCab docs for details. For the cost 100 is usually fine, for the others you need to find a similar term (part of speech, etc) in the dictionary you're using. unidic-lite is based on UniDic 2.1.2 with accent annotations. – polm23 May 18 '21 at 04:32
  • Thank you for answer. I want to ask you one more question. As you said, I did the command `fugashi-build-dict -d mymecabdicdir/ -u mydic.dic csvfile.csv` However, I get `dictionary.cpp(304) [ifs] no such file or directory: utf8` error. Modifying the `fugashi-build-dict` at the beginning of the command to `/usr/local/libexec/mecab/mecab-dict-index` will work without any problems. I also set the fugashi version to 1.1.0. Is there a problem? – Penguin_. May 23 '21 at 11:53
  • Ah, looks like `fugashi-build-dict` had a bug and didn't work. Pushed a fix and will release soon. Also in general don't follow up on Stack Overflow in comments like this, it's hard to follow, please just open an issue at Github. https://github.com/polm/fugashi – polm23 May 23 '21 at 13:30
  • 1
    This tweet has an explanation of the columns. https://twitter.com/zakki/status/920977351059554304 – coderfin Jul 21 '21 at 03:57