1

I'm using the tm package in r. Everything works properly until I include the stemCompletion. I'm getting the following error:

Error in grep(sprintf("^%s", w), dictionary, value = TRUE) : 
  invalid regular expression 

My code is as follows:

path = '~/Interviews/Transcripts/'
file.names <- dir(path, pattern = '.txt')

corpus = lapply(seq_along(file.names), function(index) {
    fileName = file.names[index]
    filePath = paste(path, fileName, sep = '')
    transcript = readChar(filePath, file.info(filePath)$size)
    transcript <- gsub("[’‘^]", '', transcript)

    corpusName = paste('transcript', index, sep = "_")

    c <- Corpus(VectorSource(transcript))
    DublinCore(c[[1]], 'Identifier') <- paste(index, fileName, sep ='_')
    meta(c, type = 'corpus')

    c <- tm_map(c, stripWhitespace)
    c <- tm_map(c, content_transformer(tolower))
    c <- tm_map(c, removeWords, c(stopwords("english"), 'yeah', 'yep'))
    c <- tm_map(c, removePunctuation)
    c <- tm_map(c, stemDocument)
    c <- tm_map(c, stemCompletion, c)
    c <- tm_map(c, PlainTextDocument)
    c
})
user3603308
  • 355
  • 4
  • 17
  • 1
    This is not reproducible. Good luck finding someone that will go dig into this. [Here are a few tricks](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on how to make a good example. – Roman Luštrik May 16 '16 at 09:56
  • What do you expect `stemCompletion` to do? – lukeA May 16 '16 at 11:25

1 Answers1

1

First, in theory you'd probably want to use tm_map(c, content_transformer(stemCompletion), c) because tm_map(c, stemCompletion, c) passes a PlainTextDocument to the argument x of stemCompletion, although it expects a character vector (see ?stemCompletion). Second, there are no stemmend tokens to stem-complete, because you did not do any tokenization (e.g. ?termDocumentMatrix), and your dictionary corpus is already stemmed, so what you are trying might not work this way anyway.

(And 3rd, I second @RomanLuštrik: Please edit your post and make it a minimal reproducible example. This way, readers & others, who witness this error, can follow easily.)

Here's an example:

content(tm_map(Corpus(VectorSource("stem completion has advantages")), stemDocument)[[1]])
# [1] "stem complet has advantag"

stemCompletion(c("complet", "advantag"), Corpus(VectorSource("stem completion has advantages")))
#      complet     advantag 
# "completion" "advantages"
lukeA
  • 53,097
  • 5
  • 97
  • 100