0
train <- read.delim('train.tsv', header= T, fileEncoding= "windows-1252",stringsAsFactors=F)

Train.tsv contains 1,56,060 lines of text with 4 column names Phrase, PhraseID, SentenceID and Sentiment(on scale of 0 to 4).Phrase column has the text lines. (Tm package already loaded) R Version: 3.1.2 ; OS: Windows 7, 64 bit, 4 GB RAM.

> dput(head(train,6)) 
structure(list(PhraseId = 1:6, SentenceId = c(1L, 1L, 1L, 1L, 
1L, 1L), Phrase = c("A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .", 
"A series of escapades demonstrating the adage that what is good for the goose", 
"A series", "A", "series", "of escapades demonstrating the adage that what is good for the goose"
), Sentiment = c(1L, 2L, 2L, 2L, 2L, 2L)), .Names = c("PhraseId", 
"SentenceId", "Phrase", "Sentiment"), row.names = c(NA, 6L), class = "data.frame")

This is the top 6 rows of train document.

clean_corpus <- function(corpus)
  {
   mycorpus <- tm_map(corpus, removeWords,stopwords("english"))  
   mycorpus <- tm_map(mycorpus, removeWords,c("movie","actor","actress"))  
   mycorpus <- tm_map(mycorpus, stripWhitespace)  
   mycorpus <- tm_map(mycorpus, tolower)  
   mycorpus <- tm_map(mycorpus, removeNumbers)
   mycorpus <- tm_map(mycorpus, removePunctuation)
   mycorpus <- tm_map(mycorpus, PlainTextDocument ) 
   return(mycorpus) 
}

# Build DTM
generateDTM <- function(df)
{
   m <- list(Sentiment = "Sentiment", Phrase = "Phrase")
   myReader <- readTabular(mapping = m)
   mycorpus <- Corpus(DataframeSource(df), readerControl = list(reader = myReader))

#Code to attach sentiment label with every text line
    for (i in 1:length(mycorpus)) 
     {
     attr(mycorpus[[i]], "Sentiment") <- df$Sentiment[i]
   }
   mycorpus <- clean_corpus(mycorpus)
   dtm <- DocumentTermMatrix(mycorpus)
   return(dtm)
}

dtm1 <- generateDTM(train) 

Here I have made two functions. One to clean the corpus and other to make DTM (Document Term Matrix). I have also linked each sentiment value with every line of text. Now when i use dimensions of dtm1; it shows 156060 rows but 0 columns.

So, how can i generate a DTM with sentiment labels attached?

Ken Benoit
  • 14,454
  • 27
  • 50
  • It would help if you could create a minimal [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). If we don't have sample input data, we can't run your code to see where the error is coming from. – MrFlick Jan 31 '15 at 08:17
  • @MrFlick I hope the edited version will help. If your want full training data then please let me know. – Avneesh047 Jan 31 '15 at 11:37

1 Answers1

1

When you set up your reader, you want to map something to the "content" of the document, otherwise it doesn't know what text to use to make the corpus. Othe rvalues are stored as metadata. Try changing the code to

m <- list(Sentiment = "Sentiment", content = "Phrase")
myReader <- readTabular(mapping = m)
mycorpus <- Corpus(DataframeSource(df), readerControl = list(reader = myReader))
MrFlick
  • 195,160
  • 17
  • 277
  • 295