Good way to add terms to python pattern singularize

Question

I am using python pattern to get the singular form of English nouns.

    In [1]: from pattern.en import singularize
    In [2]: singularize('patterns')
    Out[2]: 'pattern'
    In [3]: singularize('gases')
    Out[3]: 'gase'

I am solving the problem in the second example by defining

    def my_singularize(strn):
        '''
        Return the singular of a noun. Add special cases to correct pattern generic rules.
        '''
        exceptionDict = {'gases':'gas','spectra':'spectrum','cross':'cross','nuclei':'nucleus'}
        try:
            return exceptionDict[strn]
        except:
            return singularize(strn)

Is there a better way to do this, e.g. add to the rules of pattern, or make the exceptionDict somehow internal to pattern?

How can you expect to catch all the exceptions in the English language (words like nuclei)? Are you using a finite number of words as your input, and you know all of them? You won't get anywhere trying to define all of the word exceptions, I can guarantee you. — Elias Benevedes, May 10 '14 at 21:40
Yes, I wasn't thinking of catching all exceptions. However, my corpus is limited to scientific literature, which might make it easier. I guess the question is: does pattern already have a list of exceptions somewhere, so that I can add to that, instead of my own function? — nikosd, May 10 '14 at 21:47
@ shyamupa: Thanks, I did not know what to look for, I guess. After a quick test the [nltk lemmatizer](http://stackoverflow.com/questions/771918/how-do-i-do-word-stemming-or-lemmatization) seems to work for most of my cases. I still need to check how much it slows things down, but I might be willing to live with this. — nikosd, May 12 '14 at 18:29

heaven00 · Accepted Answer · 2018-09-30T18:32:31.687

5

As mentioned in the comments, you would be better off by lemmatizing the words. Its part of nltk stemming module.

from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()
test_words = ['gases', 'spectrum','cross','nuclei']
%timeit [wnl.lemmatize(wrd) for wrd in test_words]

10000 loops, best of 3: 60.5 µs per loop

compared to your function

%timeit [my_singularize(wrd) for wrd in test_words]
1000 loops, best of 3: 162 µs per loop

nltk lemmatizing performs better.

edited Sep 30 '18 at 18:32

answered Jun 07 '15 at 18:45

heaven00

200
3
9

Could you add the import clause to use this wnl module ? – Josir Sep 04 '18 at 11:30
@Josir Added :) – heaven00 Sep 30 '18 at 18:32
How do we use it for french word? – mee May 24 '19 at 12:25
@mee i havent directly worked on french language but there is one discussion here https://stackoverflow.com/questions/13131139/lemmatize-french-text – heaven00 May 25 '19 at 13:52
@heaven00 Thank you for your answer, I will try the spacy solution. I think it is the best. – mee May 28 '19 at 14:02

Good way to add terms to python pattern singularize

1 Answers1

Linked