16

I want to know, in several languages, if two words are:

  • either the same word,
  • or the grammatical variants of the same word.

For example:

  • had and has has the same base: in both cases, it's the verb have,
  • city and cities has the same base.
  • went and gone has the same base.

Is there a way to use the Microsoft Word API to not just spell check text, but also normalize a word to a base or, at least, determine if two words have the same base?

If not, what are the (free or paid) libraries (not web services) which allow me to do it (again, in several languages)?

Arseni Mourzenko
  • 50,338
  • 35
  • 112
  • 199
  • When you say you want this done in several languages, do you mean that the words you're comparing can be different languages in a single comparison? For instance, should the English word 'city' be found as a match for the German word, 'Stadt'? Or do you expect that the two words you're comparing at least live in the same dictionary? – M.Babcock Jan 13 '12 at 19:59
  • @M.Babcock: I compare only English to English, German to German, so I need only one dictionary at a time. – Arseni Mourzenko Jan 13 '12 at 20:01
  • 3
    If it helps your search -- the usual term for that is "stemming" (see http://en.wikipedia.org/wiki/Stemming). – ruakh Jan 13 '12 at 22:11
  • Does "clearing" have the same base as "clearings" (noun = open spaces) or "clear/clears/cleared" (verb)? And is "clear" the same as "clearer/clearest", or is it an adverb with no related words? Without the context of how the word is being used in the sentence, and a (very large) dictionary of word forms and their relationships, the best approach is to use stemming and accept that it will have a small error rate. – Bradley Grainger Jan 13 '12 at 22:33
  • @BradleyGrainger: I would say that if one word is an noun and another is a verb, the API must tell that those are two different words. But I'll still probably accept the API that says those two words have the same base. – Arseni Mourzenko Jan 13 '12 at 22:57
  • @MainMa: But for the specific examples I posted, what are the "right" answers? What does `HasSameBase("clearing", "clear")` return? What about `HasSameBase("clearings", "cleared")` (or any other pairs of the words I listed)? My point is that natural language is very complex, and what you're asking for is difficult to implement. Is it possible to relax your requirements, so that using a stemming library (with its small rate of false negatives and false positives) is "good enough"? – Bradley Grainger Jan 21 '12 at 03:16
  • @BradleyGrainger: **I'm interested in both solutions**, even if when I posted the original question, for me, `HasSameBase("clearing", "clear")` would return `false`, and `HasSameBase("clearings", "cleared")` would return `false` too. – Arseni Mourzenko Jan 24 '12 at 00:55

2 Answers2

2

Inflector.NET is an open source library that you can use to normalize the inflection of English nouns. Available at: https://github.com/davidarkemp/Inflector/tree/master/Inflector

smartcaveman
  • 41,281
  • 29
  • 127
  • 212
  • (1) It seem available only for English. (2) Even for English, it will not work: even if it works for one (city/cities) of the three examples I've given in my question, it fails for other two, not counting all the edge cases which exist in English grammar. – Arseni Mourzenko Jan 13 '12 at 20:52
  • @MainMa, the class allows for including additional "edge cases". This class only works for **nouns** (I have updated my answer to reflect this). You are correct that this is English only, but you may be able to leverage the design patterns in a more-localized implementation. Good luck – smartcaveman Jan 13 '12 at 21:43
1

Snowball is a stemming API that can handle various natural languages and there are Snowball implementations for various programming languages.

http://snowball.tartarus.org/

Sprachprofi
  • 1,229
  • 12
  • 24