Lemmatizer in R or python (am, are, is -> be?)

Question

I'm not a [computational] linguistic, so please excuse my supper dummy-ness in this topic.

According to Wikipedia, lemmatisation is defined as:

Lemmatisation (or lemmatization) in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.

Now my question is, is the lemmatised version of any member of the set {am, is, are} supposed to be "be"? If not, why not?

Second question: How do I get that in R or python? I've tried methods like this link, but non of them gives "be" given "are". I guess at least for the purpose of classifying text documents, this makes sense to be true.

I also couldn't do that with any of the given demos here.

What am I doing/assuming wrong?

I don't understand why you think this question is too broad, and the questions I'm asking are very specific, and also the answer given here satisfies me. I wasn't sure if the problem is with tools like Wordnet interface in R or I'm inferring something wrong from the definition of lemmatization. — adrin, Apr 11 '14 at 09:34

jlhoward · Accepted Answer · 2014-04-10T19:14:56.027

So here is a way to do it in R, using the Northwestern University lemmatizer, MorphAdorner.

lemmatize <- function(wordlist) {
  get.lemma <- function(word, url) {
    response <- GET(url,query=list(spelling=word,standardize="",
                                   wordClass="",wordClass2="",
                                   corpusConfig="ncf",    # Nineteenth Century Fiction
                                   media="xml"))
    content <- content(response,type="text")
    xml     <- xmlInternalTreeParse(content)
    return(xmlValue(xml["//lemma"][[1]]))    
  }
  require(httr)
  require(XML)
  url <- "http://devadorner.northwestern.edu/maserver/lemmatizer"
  return(sapply(wordlist,get.lemma,url=url))
}

words <- c("is","am","was","are")
lemmatize(words)
#   is   am  was  are 
# "be" "be" "be" "be"

As I suspect you are aware, correct lemmatization requires knowledge of the word class (part of speech), contextually correct spelling, and also depends upon which corpus is being used.

This does not work anymore Error: 1: Opening and ending tag mismatch: HR line 1 and body 2: Opening and ending tag mismatch: HR line 1 and html 3: Premature end of data in tag body line 1 4: Premature end of data in tag html line 1 — Marcin, Apr 14 '17 at 11:07

Lemmatizer in R or python (am, are, is -> be?)

1 Answers1

Linked