2

I'm having a difficulties to understand R stemming word process.

In my example, i created the following corpus object

a <- Corpus(VectorSource("device so much more funand  unlike most android torrent download clients"))

So a is

a[[1]]$content

[1] "device so much more funand  unlike most android torrent download clients"

The first word in this string is "device", I created my term matrix

b <- TermDocumentMatrix(a, control = list(stemming = TRUE)) 

and got this as an output

dimnames(b)$Terms
[1] "android"  "client"   "devic"    "download" "funand"   "more"     "most"      "much"     "torrent" 
[10] "unlik"

What i like to know is why i lost the "e" at "device" and "unlike" but did not loss it at "more".

how can i avoid this from happening in this word and in some others?

Thanks.

Tomer
  • 23
  • 3
  • Read the documentation for the Porter stemmer. This is off-topic on SO: use [CrossValidated](http://stats.stackexchange.com/search?q=Porter+stemmer). Unless you actually want to write a custom stemmer, which is a different question. – smci Aug 26 '15 at 22:02

2 Answers2

0

I'm assuming you are using the tm and SnowballC packages.

Stemming in these packages works using the Porter Stemming algorithm (in English).

If you want to play around with stemming algorithms, you can run:

getStemLanguages()

and try using others - The only other English built in is here:

wordStem(words, language = "english")

Which for your data, returns the same:

 [1] "android"  "client"   "devic"    "download" "funand"   "more"     "most"     "much"     "torrent" 
[10] "unlik" 
jeremycg
  • 24,657
  • 5
  • 63
  • 74
0

Another option is to use the MorphAdorner lemmatizer at Northwestern University. This answer has the code for the lemmatize(...) function.

library(tm)
a     <- Corpus(VectorSource("device so much more funand  unlike most android torrent download clients"))
words <- Terms(TermDocumentMatrix(a))
lemmatize(words)
#    android    clients     device   download     funand       more       most       much    torrent     unlike 
#  "android"   "client"   "device" "download"   "funand"     "more"     "most"     "much"  "torrent"   "unlike" 

As you can see, it removes the "s" from "clients" but not the "e" from "device".

Community
  • 1
  • 1
jlhoward
  • 58,004
  • 7
  • 97
  • 140