train <- read.delim('train.tsv', header= T, fileEncoding= "windows-1252",stringsAsFactors=F)
Train.tsv contains 1,56,060 lines of text with 4 column names Phrase, PhraseID, SentenceID and Sentiment(on scale of 0 to 4).Phrase column has the text lines. (Tm package already loaded) R Version: 3.1.2 ; OS: Windows 7, 64 bit, 4 GB RAM.
> dput(head(train,6))
structure(list(PhraseId = 1:6, SentenceId = c(1L, 1L, 1L, 1L,
1L, 1L), Phrase = c("A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .",
"A series of escapades demonstrating the adage that what is good for the goose",
"A series", "A", "series", "of escapades demonstrating the adage that what is good for the goose"
), Sentiment = c(1L, 2L, 2L, 2L, 2L, 2L)), .Names = c("PhraseId",
"SentenceId", "Phrase", "Sentiment"), row.names = c(NA, 6L), class = "data.frame")
This is the top 6 rows of train document.
clean_corpus <- function(corpus)
{
mycorpus <- tm_map(corpus, removeWords,stopwords("english"))
mycorpus <- tm_map(mycorpus, removeWords,c("movie","actor","actress"))
mycorpus <- tm_map(mycorpus, stripWhitespace)
mycorpus <- tm_map(mycorpus, tolower)
mycorpus <- tm_map(mycorpus, removeNumbers)
mycorpus <- tm_map(mycorpus, removePunctuation)
mycorpus <- tm_map(mycorpus, PlainTextDocument )
return(mycorpus)
}
# Build DTM
generateDTM <- function(df)
{
m <- list(Sentiment = "Sentiment", Phrase = "Phrase")
myReader <- readTabular(mapping = m)
mycorpus <- Corpus(DataframeSource(df), readerControl = list(reader = myReader))
#Code to attach sentiment label with every text line
for (i in 1:length(mycorpus))
{
attr(mycorpus[[i]], "Sentiment") <- df$Sentiment[i]
}
mycorpus <- clean_corpus(mycorpus)
dtm <- DocumentTermMatrix(mycorpus)
return(dtm)
}
dtm1 <- generateDTM(train)
Here I have made two functions. One to clean the corpus and other to make DTM (Document Term Matrix). I have also linked each sentiment value with every line of text. Now when i use dimensions of dtm1; it shows 156060 rows but 0 columns.
So, how can i generate a DTM with sentiment labels attached?