2

I'm not a [computational] linguistic, so please excuse my supper dummy-ness in this topic.

According to Wikipedia, lemmatisation is defined as:

Lemmatisation (or lemmatization) in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.

Now my question is, is the lemmatised version of any member of the set {am, is, are} supposed to be "be"? If not, why not?

Second question: How do I get that in R or python? I've tried methods like this link, but non of them gives "be" given "are". I guess at least for the purpose of classifying text documents, this makes sense to be true.

I also couldn't do that with any of the given demos here.

What am I doing/assuming wrong?

Community
  • 1
  • 1
adrin
  • 4,511
  • 3
  • 34
  • 50
  • 2
    I don't understand why you think this question is too broad, and the questions I'm asking are very specific, and also the answer given here satisfies me. I wasn't sure if the problem is with tools like Wordnet interface in R or I'm inferring something wrong from the definition of lemmatization. – adrin Apr 11 '14 at 09:34

1 Answers1

6

So here is a way to do it in R, using the Northwestern University lemmatizer, MorphAdorner.

lemmatize <- function(wordlist) {
  get.lemma <- function(word, url) {
    response <- GET(url,query=list(spelling=word,standardize="",
                                   wordClass="",wordClass2="",
                                   corpusConfig="ncf",    # Nineteenth Century Fiction
                                   media="xml"))
    content <- content(response,type="text")
    xml     <- xmlInternalTreeParse(content)
    return(xmlValue(xml["//lemma"][[1]]))    
  }
  require(httr)
  require(XML)
  url <- "http://devadorner.northwestern.edu/maserver/lemmatizer"
  return(sapply(wordlist,get.lemma,url=url))
}

words <- c("is","am","was","are")
lemmatize(words)
#   is   am  was  are 
# "be" "be" "be" "be" 

As I suspect you are aware, correct lemmatization requires knowledge of the word class (part of speech), contextually correct spelling, and also depends upon which corpus is being used.

jlhoward
  • 58,004
  • 7
  • 97
  • 140
  • This takes waaaaay tooo long – Marcin Jun 20 '16 at 21:52
  • This does not work anymore Error: 1: Opening and ending tag mismatch: HR line 1 and body 2: Opening and ending tag mismatch: HR line 1 and html 3: Premature end of data in tag body line 1 4: Premature end of data in tag html line 1 – Marcin Apr 14 '17 at 11:07