1

I have 10 rows of text data in a CSV file. I want to make corrcetion to the various misspellings. For example of the word "battery" ( misspelled as "battere" or "batt" etc ). I consider using StemDocument followed by stemCompletion, and hence used the following code:

library(tm)
library(SnowballC)
text.var<-read.csv("C:\\Users\\Sambit\\Desktop\\Sample Data.csv",header=FALSE)
data_corp<-Corpus(VectorSource(text.var))

data_corp.copy<-data_corp
data_corp<-tm_map(data_corp, stemDocument)
data_corp<-tm_map(data_corp, stemCompletion, dictionary=data_corp.copy)

However, the last step , that is the Stem Completion step is showing the following error:

Error in setNames(if (length(n)) n else rep(NA, length(x)), x) : 
  'names' attribute [10] must be the same length as the vector [2]
In addition: Warning messages:
1: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
  argument 'pattern' has length > 1 and only the first element will be used
2: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
  argument 'pattern' has length > 1 and only the first element will be used

Where did I go possibly wrong?

SamRoy
  • 221
  • 1
  • 3
  • 9
  • Try this workaround: http://stackoverflow.com/a/26696490/1036500 – Ben Nov 09 '14 at 21:34
  • It works mostly. Thanks for that. But in one case, "batetry" is misspelled as "batteri"; and its StemCompletion gives NA; whereas "batter" is correctly identified as "bettery". Any suggestion? – SamRoy Nov 10 '14 at 04:37
  • Not really sure, seems like the misspelling is too different from the original word for the stemming to be useful. You might try a dictionary approach if you know all the misspelled words and what they should be – Ben Nov 10 '14 at 06:03
  • Actually, the word "batteri" is remainng unchanged after Stemming only; that's why it is unaffected by StemCompletion. I wonder why :/ – SamRoy Nov 11 '14 at 04:56

1 Answers1

0

Stem completion is not working in 3.1.2 version of R. I wrote similar function but it is really really slow.

manualStemCorpus <- function(x) {
  words <- scan_tokenizer(x[[1]])
  for (i in seq(x) ) {
    words <- c(words, scan_tokenizer(x[[i]]))
  }

  words <- unique(sort(words))
  words.st <- stemDocument(words, language='english') 

  x <- tm_map(x, stemDocument, language = 'english')

  for (i in seq(words) ) {
    print(i)
    for (j in 1:length(x)){
      patt <- paste0('\\b', words.st[i], '\\b')
      x[[j]]<- gsub(patt, words[i], x[[j]], fixed=TRUE)
    }
  }

  return(x)

}

You can check it with

numberOfWords <- function(x) {
  a <- scan_tokenizer(x[[1]])
  for (i in seq(x) ) {
    a <- c(a, scan_tokenizer(x[[i]]) )
  }
  a <- unique(sort(a))
  return(length(a))
}