10

In an application I work on, we use Lucene Analyzer, especially it's Hunspell part. The problem I face is: I need to generate all word forms of a word, using a set of affix rules.

E.g. having the word 'educate' and affix rules ABC, generate all forms of word 'educate.' - educates, educated, educative, etc.

What I'd like to know is: is it possible to do this using Lucene's Hunspell implementation (we use a Hunspell dictionary (.dic) and affix file (.aff), so it has to be a Hunspell API)? Lucene's Hunspell API isn't that big, I went through it, and didn't find something suitable.

Nearest I could find on SO was this, but there are no answers related to hunspell.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Haris Osmanagić
  • 1,249
  • 12
  • 28

4 Answers4

9

Hunspell comes with the unmunch command, which will create all word forms. You can call it like this:

 unmunch en_GB.dic en_GB.aff

Thus you might look in the hunspell source how this is implemented and whether it can be called from outside. The command was a bit buggy last time I checked when used on dictionaries with compounds - in those cases you cannot create all wordforms anyway, as there is an infinite number of them.

Daniel Naber
  • 1,594
  • 12
  • 19
  • Thanks a lot the answer Daniel! I'm aware of unmunch. Calling it isn't an option for the use case where I want to add a new word, and I want to have a preview of all of its forms. I did try looking up in the source how it's implemented, but then I thought: if it's already implemented in Lucene, then I won't have to produce a new buggy port, and it will be consistent with other parts of Lucene. – Haris Osmanagić Dec 06 '12 at 10:21
  • @HarisOsmanagić, it seems to me that it is, on the contrary, what you were looking for. You would write your new word root with affix rules (e.g. `educate/ABC`) in a custom .dic file, then you would call `unmunch` with this custom dictionary and the standard affix file for your language. – Maëlan Mar 16 '21 at 14:54
6

I think what you're looking for is Hunspell's wordforms command:

Usage: wordforms [-s | -p] dictionary.aff dictionary.dic word
-s: print only suffixed forms
-p: print only prefixed forms

Example:

$ wordforms en_US.aff en_US.dic educate
educating
educated
educate
educates
educates

Read more in the documentation.

Pillowcase
  • 684
  • 6
  • 7
  • do you happen to know if this is callable from Lucene? – Haris Osmanagić Mar 11 '15 at 07:49
  • 1
    how can i write the output to a file? – Furkan Gözükara Mar 02 '17 at 23:32
  • It doesn’t seem to work very well: you are missing "educative", "education", "educational", "educable", etc. and "educates" is repeated. – dardo82 Dec 07 '17 at 12:39
  • @dardo82 the results depend on the quality of the dictionaries supplied. Mine outputs [educating ,educated, educates, education, educate, educative] so not everything you list but better. – byoungb May 11 '18 at 14:22
  • @byoungb what dictionary are you using? – dardo82 May 11 '18 at 16:16
  • Well I am actually creating my own [link](https://github.com/publitek/stemming_dictionary)... I really use this just for stemming with search engines so that related words match the same results. But in that repo I have a link to the original dictionaries that I started with.. I could also send you the ones that I ran to get the output above... though not sure where they originated. – byoungb May 11 '18 at 16:54
  • `wordforms` only allows to list allowed word forms for a `word` that already exists in the reference `dictionary.{aff,dic}`. This is not how you generate forms for your own word roots, which is what the original question was about, I believe. – Maëlan Mar 16 '21 at 14:58
4

(The original question was about generating all forms for one given word. This answer focuses on the harder problem of generating all forms for all words of a dictionary. I post this here as this is what comes up when searching for the harder problem.)

Update on unmunching

As of 2021, Hunspell provides two tools which are called unmunch and wordforms for generating word forms. Their respective usage is:

# print all forms for all words whose stems are given in `stems.dic`
# and make use of affix rules defined in `affixes.aff`:
unmunch   stems.dic affixes.aff
# print the forms of ONE given word (a single stem with no affix flag)
# which are allowed by the reference dictionary defined by the pair of
# `stems.dic` and `affixes.aff`:
wordforms affixes.aff stems.dic word

So affixes.aff would be given by your language, and stems.dic would be either a reference dictionary for your language, or a custom dictionary with the stems of the new words you want to generate.

Unfortunately, Hunspell’s unmunch is deprecated¹ and does not work properly. It is inherited from MySpell, and my guess is that it does not support all features of Hunspell. When I tried using it with the reference French dictionary (Dicollecte, v7.0), it generated garbage words by applying affix rules it was not supposed to apply (such as: conjugating non-verbs), and failed to generate many expected words. Some of the defects I could pinpoint:

  • apparently it does not properly support UTF-8?
  • it does not understand FLAG long, which leads to many affix rules being applied out of the blue;
  • it wrongly parses metadata attached to stems as affix flags, leading to even more arbitrary rules being applied;
  • it does not understand 0 as meaning the empty string and thus generate garbage words containing 0;
  • it seems to limit derivations to at most 2 rules, and thus misses many expected words.

wordforms should be more up-to-date, so you might try to emulate unmunch with wordforms (as the README suggests), but the latter only takes one unqualified stem, and compares it against the whole dictionary implied by stems.dic and affixes.aff. This takes a lot of time per stem and, worst, you would have to call wordforms in turn with all the stems in stems.dic. So you would have a quadratic time. For me, with the reference set of affixes for French, this is slow to the point of being unusable—even with only 10 stems! The unusable Bash code is, for illustration:

# /!\ EXTREMELY SLOW
aff='affixes.aff'
dic='stems.dic'
cat "$dic" | while read -r stem ; do # read each stem of the file
    stem="${stem%%/*}" # strip the stem from the optional slash (attached affix flags)
    wordforms "$aff" "$dic" "$stem" # generate all forms for this stem
done \
| sort -u # sort (according to the locale) and remove duplicates

Also, note that wordforms produces bare words, while unmunch was able to attach derived metadata (such as part-of-speech or gender), so with wordforms you lose information (which may or may not matter to you).

The lack of a replacement for unmunch is a known issue. Apparently Hunspell developers will not address it in a predictable future (something about funding?). This has led to several people reimplementing the functionality, you’ll find pointers throughout GitHub issues.

  • In 2012 someone wrote an sh/awk script by adapting the source code of wordforms; maybe severely outdated, but I haven’t tried it.
  • In 2014 someone wrote another sh/awk script to treat a Hindi dictionary; for me it worked slightly better than the built-in unmunch, as it does support FLAG long, but still has the other defects mentioned above.
  • In December 2020 someone wrote a Perl module and a Perl program; looks great, but I’m not sure how to use them.
  • In 2021 I wrote my own version, that does fix the defects mentioned above, and works for French. It does not support every feature though, see the comments in the source code for a list of unsupported features.

¹ From the repo’s README.

Maëlan
  • 3,586
  • 1
  • 15
  • 35
0

To look for all created forms of one word, assuming en_US.dic contains: word/abc, create a file:

1
word/abc

and save it as word.dic. Use:

unmunch word.dic en_US.aff

and you get all created forms of word.