In German, every job has a feminine and a masculine version. The feminine one is derived from the masculine one by adding an "-in" suffix. In the plural form, this turns into "-innen".
Example:
| English | German
------+------------------+-----------------------
masc. | teacher doctor | Lehrer Arzt
fem. | teacher doctor | Lehrerin Ärztin
masc. | teachers doctors | Lehrer Ärzte
fem. | teachers doctors | Lehrerinnen Ärztinnen
Currently, I'm using NLTK's nltk.stem.snowball.GermanStemmer
.
It returns these stems:
Lehrer -> lehr | Arzt -> arzt
Lehrerin -> lehrerin | Ärztin -> arztin
Lehrer -> lehr | Ärzte -> arzt
Lehrerinnen -> lehrerinn | Ärztinnen -> arztinn
Is there a way to make this stemmer return the same stems for all four versions, feminine and masculine ones? Alternatively, is there any other stemmer doing that?
Update
I ended up adding "-innen" and "-in" as the first entries in the step 1 suffix-tuple like so:
stemmer = GermanStemmer()
stemmer._GermanStemmer__step1_suffixes = ("innen", "in") + stemmer._GermanStemmer__step1_suffixes
This way all of the above words are stemmed to lehr
and arzt
respectively. Also, all other "job-forms" that I tried so far are stemmed correctly, meaning masculine and feminine forms have the same stem. Also, if the "job-form" is derived from a verb, like Lehrer/in
, they have the same stem as the verb.