7

I would like to parse open office supporting hunspell formatted aff and dic files.

English aff and dic files can be downloaded from here for example : http://extensions.openoffice.org/en/project/english-dictionaries-apache-openoffice

I want to scan each line of the given .dic file and generate every possible word of the each line with the provided .aff file

How can i do that?

I have installed NHunspell framework but it does not have that feature : https://www.nuget.org/packages/NHunspell/

For example for the english language lets consider

make/UAGS

make can be make, made, makes, making etc

Now i need parser to give me all these combinations. How can i obtain them? Ty very much

So basically i want to scan each line of the dictionary and generate all possible words from the word of that line and i dont know how can i do that

I can also write my own parsers, but it seems to me rules are pretty complex and there are no detailed and easy documentation about this

Here what i want basically. The image explains very clearly

Giving analyze/ADSG, en.dic and en.aff file and obtaining all the following words

analyze, analyzes, analyzing, analyzed, reanalyze, reanalyzes, reanalyzing, reanalyzed

enter image description here

Furkan Gözükara
  • 22,964
  • 77
  • 205
  • 342
  • I don't know if that's viable without a third party library, even if you wrote your own parsers there would be a lot of exception. What do you need all forms of the word for – johnny 5 Mar 02 '17 at 22:12
  • so i can have a static list of all words in that particular language and keep associations of words (e.g. makes is composed from make). that is necessary for my application. i mean these applications already has to be obtaining all forms of the given words in dictionary to do particular actions. so i believe there must be a way to do this – Furkan Gözükara Mar 02 '17 at 22:28
  • There is a stand way in which most languages convert words, I.E Future Tense Make -> Present tense -> Making. Shake -> Shaking, you can create rules future->present tense end in e drop the e and add ing. This will generally work for most words, but things going from future tense to past however differ a lot of the time. Make -> Made, Shake -> Shook, Run -> Ran, Where there may be rules that you can create but there will still be a lot of exceptions. I think your best bet would be to look for a pre-existing DB of associated words, or to look for a third party library who will handle that – johnny 5 Mar 02 '17 at 22:46
  • @johnny5 i really need to solve this problem :( i have updated my question. I am pretty sure what i want is possible. Check the updated question ty – Furkan Gözükara Mar 02 '17 at 23:08
  • I see so all of the rules are already included and mapped with the dictionary this is way easier to understand – johnny 5 Mar 02 '17 at 23:21
  • I don't see a function to translate this based on a rule but then again I'm on my phone. The parser doesn't look too hard to write. Just store a dictionary of string to list of prefix data, match a prefix to the applied regex – johnny 5 Mar 02 '17 at 23:30
  • @johnny5 i found the command :) it is wordforms. however i dont know how to call it yet :( https://github.com/kris7t/hunspell/blob/master/src/tools/wordforms . i need to write output to a file – Furkan Gözükara Mar 02 '17 at 23:33
  • @MonsterMMORPG Did you find a way to export all the words from the DIC and AFF files? I'm dealing with this problem too. – Macondo Jun 26 '20 at 17:12
  • maybe related: https://github.com/en-wl/wordlist/tree/master/agid – muescha Aug 01 '20 at 14:28

2 Answers2

8

If you want the entire database you may execute unmunch:

unmunch dictionary.dic dictionary.aff

Note that the current implementation of unmunch in hunspell has a limitation of maximum number of words, affs, and length of generated words. So, unmunch may fail if the target language is beyond the limits of unmunch.

If you want just the list of possible words that can be generated from an entry, you may use wordforms:

wordforms dictionary.aff dictionary.dic word
Kartal Tabak
  • 760
  • 1
  • 7
  • 18
0

As Kartal Tabak pointed out, what you are looking for are the command-line tools wordforms and unmunch, which are distributed with Hunspell. But wordforms is for just one stem, and unmunch is very buggy. See this answer for alternatives.

Furthermore, it seems that Hunspell does not expose this feature as library functions. If you want to use this feature programmatically (as you mentioned C# and NHunspell), then you probably need to spawn these external programs and parse their output.

Maëlan
  • 3,586
  • 1
  • 15
  • 35