8

I am using python pattern to get the singular form of English nouns.

    In [1]: from pattern.en import singularize
    In [2]: singularize('patterns')
    Out[2]: 'pattern'
    In [3]: singularize('gases')
    Out[3]: 'gase'

I am solving the problem in the second example by defining

    def my_singularize(strn):
        '''
        Return the singular of a noun. Add special cases to correct pattern generic rules.
        '''
        exceptionDict = {'gases':'gas','spectra':'spectrum','cross':'cross','nuclei':'nucleus'}
        try:
            return exceptionDict[strn]
        except:
            return singularize(strn)

Is there a better way to do this, e.g. add to the rules of pattern, or make the exceptionDict somehow internal to pattern?

nikosd
  • 919
  • 3
  • 16
  • 26
  • How can you expect to catch all the exceptions in the English language (words like nuclei)? Are you using a finite number of words as your input, and you know all of them? You won't get anywhere trying to define all of the word exceptions, I can guarantee you. – Elias Benevedes May 10 '14 at 21:40
  • Yes, I wasn't thinking of catching all exceptions. However, my corpus is limited to scientific literature, which might make it easier. I guess the question is: does pattern already have a list of exceptions somewhere, so that I can add to that, instead of my own function? – nikosd May 10 '14 at 21:47
  • 2
    why not use something like a lemmatizer?? – shyamupa May 12 '14 at 05:39
  • @ shyamupa: Thanks, I did not know what to look for, I guess. After a quick test the [nltk lemmatizer](http://stackoverflow.com/questions/771918/how-do-i-do-word-stemming-or-lemmatization) seems to work for most of my cases. I still need to check how much it slows things down, but I might be willing to live with this. – nikosd May 12 '14 at 18:29

1 Answers1

5

As mentioned in the comments, you would be better off by lemmatizing the words. Its part of nltk stemming module.

from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()
test_words = ['gases', 'spectrum','cross','nuclei']
%timeit [wnl.lemmatize(wrd) for wrd in test_words]

10000 loops, best of 3: 60.5 µs per loop

compared to your function

%timeit [my_singularize(wrd) for wrd in test_words]
1000 loops, best of 3: 162 µs per loop

nltk lemmatizing performs better.

heaven00
  • 200
  • 3
  • 9