Speed up use of WordNet lemmatizer for Java

Question

Another question is similar to this one, but it is in different programming language and it seems to address a related but not the same problem. Is it possible to speed up Wordnet Lemmatizer?

We are stemming tons of words in a text and the code is spending more than 90% with just stemming as can be seen in the picture.

profiling the analysis process

As we read through the code a little and profile the code, it seemed like the wordNet is actually reading from file when he stems which takes most of the code execution time! Is there a way to increase the performance by, say, using a database instead of file reading to support the data for the stemming process or to load everything necessary to memory and ignore the file? Or adding some caching to the stemming process?

Are there some tools that would be easy to plug in to replace the line reading?

See the line reading profiling here:

enter image description here

As you can see, the file reading in summary takes up to 62% of run time.

Can't you put the file in RAM, e.g., `/dev/shm` on Linux? How big is the file? The OS should cache it automatically, assuming you have enough RAM. — maaartinus, Jul 23 '14 at 09:26
It has just around 36 MBs. Even the class is called PrincetonRandomAccessDictionaryFile - so this means they are reading it from memory most likely. Yet it is slow. Well there goes the thought that fetching the file took long. So is there something that could be done with the way it works? Or is it normal for reaad() and readLine() to take so long? I don't know any way to determine, if they are doing the reading innefficiently or not. — Ev0oD, Jul 23 '14 at 16:29
It [looks like](http://grepcode.com/file/repo1.maven.org/maven2/net.sf.jwordnet/jwnl/1.4_rc3/net/didion/jwnl/dictionary/morph/LookupIndexWordOperation.java?av=f) there were multiple implementations, so you'd just need to select a memory based one. — maaartinus, Jul 23 '14 at 16:54
Thanks! Now I see that there is a MapBackedDictionary and a DatabaseBackedDictionary alternative. I am going to search how to use these now. — Ev0oD, Jul 23 '14 at 17:23

score 1 · Accepted Answer · answered Jul 23 '14 at 20:38

One can use MapBackedDictionary or a DatabaseBackedDictionary instead of a FileBackedDictionary.

I describe how I succeded in running with MapBackedDictionary.

It is required to use jwnl utilities. If you open WordNet project, you can use their class DictionaryToMap.java main method to convert your existing dicitonary folder to a map fodler.

After that you can create a map_properties.xml file similar to the file_properties.xml you used earlier for your FileBackedDictionary. This time tags will differ a bit. I am posting here my example xml, which was working out well for me.

<?xml version="1.0" encoding="UTF-8"?>
<jwnl_properties language="en">
<version publisher="Princeton" number="3.0" language="en"/>
<dictionary class="net.didion.jwnl.dictionary.MapBackedDictionary">
    <param name="morphological_processor" value="net.didion.jwnl.dictionary.morph.DefaultMorphologicalProcessor">
        <param name="operations">
            <param value="net.didion.jwnl.dictionary.morph.LookupExceptionsOperation"/>
            <param value="net.didion.jwnl.dictionary.morph.DetachSuffixesOperation">
                <param name="noun" value="|s=|ses=s|xes=x|zes=z|ches=ch|shes=sh|men=man|ies=y|"/>
                <param name="verb" value="|s=|ies=y|es=e|es=|ed=e|ed=|ing=e|ing=|"/>
                <param name="adjective" value="|er=|est=|er=e|est=e|"/>
                <param name="operations">
                    <param value="net.didion.jwnl.dictionary.morph.LookupIndexWordOperation"/>
                    <param value="net.didion.jwnl.dictionary.morph.LookupExceptionsOperation"/>
                </param>
            </param>
            <param value="net.didion.jwnl.dictionary.morph.TokenizerOperation">
                <param name="delimiters">
                    <param value=" "/>
                    <param value="-"/>
                </param>
                <param name="token_operations">
                    <param value="net.didion.jwnl.dictionary.morph.LookupIndexWordOperation"/>
                    <param value="net.didion.jwnl.dictionary.morph.LookupExceptionsOperation"/>
                    <param value="net.didion.jwnl.dictionary.morph.DetachSuffixesOperation">
                        <param name="noun" value="|s=|ses=s|xes=x|zes=z|ches=ch|shes=sh|men=man|ies=y|"/>
                        <param name="verb" value="|s=|ies=y|es=e|es=|ed=e|ed=|ing=e|ing=|"/>
                        <param name="adjective" value="|er=|est=|er=e|est=e|"/>
                        <param name="operations">
                            <param value="net.didion.jwnl.dictionary.morph.LookupIndexWordOperation"/>
                            <param value="net.didion.jwnl.dictionary.morph.LookupExceptionsOperation"/>
                        </param>
                    </param>
                </param>
            </param>
        </param>
    </param>
    <param name="dictionary_element_factory" value="net.didion.jwnl.data.MapBackedDictionaryElementFactory"/>
    <param name="file_type" value="net.didion.jwnl.princeton.file.PrincetonObjectDictionaryFile"/>
    <param name="dictionary_path" value="path\to\wordnetMap\"/>
</dictionary>
<resource class="PrincetonResource"/>
</jwnl_properties>

Pay attention to the path to wordnetMap - set it to where you output the conversion of dictionary with the method mentioned earlier.

Don't forget to initialize JWNL with the new properties file. The MapBackedDictionary will take longer to load initially, but the performance boost is extreme.

This looks like a really terrible XML mess to me. I really wonder why they destroy the usability instead of simply prefetching the file (if any XML is needed, it should stay the same). — maaartinus, Jul 24 '14 at 02:11
not sure what you mean. Anyways, I myself just copied some xml from an extJwnl library, since I could find practically no documentation on how the xml configuration is to be set. Found what had to be modified and modified it. I see no problem with xml files, if there are some documentations to it, but this irritated me too. — Ev0oD, Jul 24 '14 at 07:32

Speed up use of WordNet lemmatizer for Java

1 Answers1