0
# init
libs <- c("tm", "plyr", "class", "RTextTools", "randomForest")
lapply(libs, require, character.only = TRUE)

# set options
options(stringsAsFactors = FALSE)

# set parameters
labels <- read.table('labels.txt')
path <- paste(getwd(), "/data", sep="")

# clean text
cleanCorpus <- function(corpus) {
  corpus.tmp <- tm_map(corpus, removePunctuation)
  corpus.tmp <- tm_map(corpus.tmp, removeNumbers)
  corpus.tmp <- tm_map(corpus.tmp, stripWhitespace)
  corpus.tmp <- tm_map(corpus.tmp, content_transformer(tolower))
  corpus.tmp <- tm_map(corpus.tmp, stemDocument, language = "english")
  corpus.tmp <- tm_map(corpus.tmp, removeWords, stopwords("english"))
  return(corpus.tmp)
}

# build TDM
generateTDM <- function(label, path) {
  s.dir <- sprintf("%s/%s", path, label)
  s.cor <- Corpus(DirSource(directory = s.dir), readerControl = list(language = "en"))
  s.cor.cl <- cleanCorpus(s.cor)
  s.tdm <- TermDocumentMatrix(s.cor.cl)
  s.tdm <- removeSparseTerms(s.tdm, 0.7)
  return(list(name = label, tdm = s.tdm))
}

tdm <- lapply(labels, generateTDM, path = path)

# attach name
bindLabelToTDM <- function(tdm) {
  s.mat <- t(data.matrix(tdm[["tdm"]]))
  s.df <- as.data.frame(s.mat, stringsAsFactors = FALSE)
  s.df <- cbind(s.df, rep(tdm[["name"]], nrow(s.df)), row.names = NULL)
  colnames(s.df)[ncol(s.df)] <- "targetlabel"
  return(s.df) 
}

labelTDM <- lapply(tdm, bindLabelToTDM)

# stack
tdm.stack <- do.call(rbind.fill, labelTDM)
tdm.stack[is.na(tdm.stack)] <- 0

# hold-out
train.idx <- sample(nrow(tdm.stack), ceiling(nrow(tdm.stack) * 0.7))
test.idx <- (1:nrow(tdm.stack)) [- train.idx]

tdm.lab <- tdm.stack[, "targetlabel"]
tdm.stack.nl <- tdm.stack[, !colnames(tdm.stack) %in% "targetlabel"]

train <- tdm.stack[train.idx, ]
test <- tdm.stack[test.idx, ]

train$targetlabel <- as.factor(train$targetlabel)
label.rf <- randomForest(targetlabel ~ ., data = train, ntree = 5000, mtry = 15, importance = TRUE)

I am trying multi class classfication for text files using randomForest algorithms. The error I get is probably because of the last or second last line.

Error in eval(expr, envir, enclos) : object '∗' not found

tdm.stack contains columns with names as words found in the document and their cell values as their frequency. The last column contains the class value.

I have tried everything I cant figure out the problem. Please help.

abhinav
  • 11
  • 3
  • could we take a look at your text file? – erasmortg Sep 30 '15 at 11:02
  • the text files are research articles from arxiv that i have converted from pdf to txt using xpdf. i am trying to classify them into their particular fields like cs.AI or cs.CV and so on. I am only classifying them as one class so this is not multi label classfication but multi class. and my data set is also mapped to one class per txt. There are a total of 29 classes. – abhinav Sep 30 '15 at 11:05
  • Seems to me that you are trying to classify with english but are encountering some non-standard characters (names, perhaps). If so, maybe take a look at this? http://stackoverflow.com/questions/18153504/removing-non-english-text-from-corpus-in-r-using-tm – erasmortg Sep 30 '15 at 11:12
  • I added these two lines to my cleanCorpus function corpus.tmp <- tm_map(corpus.tmp, function(x) iconv(x, "latin1", "ASCII", sub="")) corpus.tmp <- tm_map(corpus.tmp, PlainTextDocument) That error went away but now i get this error Error in model.frame.default(terms(reformulate(attributes(Terms)$term.labels)), : invalid type (special) for variable 'function' – abhinav Sep 30 '15 at 11:34
  • so cleanCorpus does not throw any error message right now, but another function in your script does, correct? which one? at what point are you getting the error? – erasmortg Sep 30 '15 at 11:43
  • i am getting this error after the last line containing the randomForest() executes – abhinav Sep 30 '15 at 11:51
  • are `train` and `test` data.frames? try converting to those types and running the last lines – erasmortg Sep 30 '15 at 11:59
  • yes they are data frames. i still try train <- as.data.frame(train) but no effect as expected. – abhinav Sep 30 '15 at 12:01
  • going blind is a little difficult, could you post one text that replicates the behavior so it can be debugged by example? make sure to include the pertinent labels as well – erasmortg Sep 30 '15 at 12:04
  • my data directory has 29 subdirectories which are named after the classes. the training data are put in their respective class directories. Here is an example of a txt from cs.AI http://pastebin.com/69nSm1FY – abhinav Sep 30 '15 at 12:13
  • I've been trying with a text dir of my own, and got to the same point. I suggest asking a new question as your original problem was in fact solved in the comments – erasmortg Sep 30 '15 at 13:39
  • 1
    thanks for your help. i posted a new question http://stackoverflow.com/questions/32867905/randomforest-in-r-invalid-type-special-for-variable-function-error – abhinav Sep 30 '15 at 13:52

1 Answers1

0

The error was being caused by the presence of non-ASCII characters in my corpuses. I added this line to my cleanCorpus function to remove non-ASCII characters

corpus.tmp <- tm_map(corpus.tmp, function(x) iconv(x, "latin1", "ASCII", sub=""))

This solved the problem.

abhinav
  • 11
  • 3