1

I used tm package from R for text mining. This is what my code looks like:

library(tm)

Load the data in R

pathToData = "R/group_data"
 newsCorpus = Corpus(DirSource(pathToData, recursive = TRUE), 
                readerControl = list(reader = readPlain))

Length of news corpus

      length(newsCorpus)

Pre-processsing the corpus data

newsCorpus = tm_map(newsCorpus,removePunctuation)
newsCorpus[["103806"]]

newsCorpus = tm_map(newsCorpus,removeNumbers)
newsCorpus[["103806"]]

newsCorpus = tm_map(newsCorpus, content_transformer(tolower))
newsCorpus[["103806"]]

newsCorpus = tm_map(newsCorpus, removeWords, stopwords("english"))
newsCorpus[["103806"]]

newsCorpus = tm_map(newsCorpus, stripWhitespace)
newsCorpus[["103806"]]

Corpus elements to plain text

newsCorpus = Corpus(VectorSource(newsCorpus))

Document Term matrix with TFIDF weights

docTermMatrix = DocumentTermMatrix(newsCorpus, 
                               control = list(weighting = weightTfIdf, 
                                              minWordLength = 1,
                                              minDocFreq = 1))                                                  
                                              

Dimensions of resulting matrix

dim(docTermMatrix)

The docTermMatrix looks like this:

<<DocumentTermMatrix (documents: 1986, terms: 22213)>>
 Non-/sparse entries: 173995/43941023
 Sparsity           : 100%
 Maximal term length: 163
 Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

Now I want to inspect the docTermMatrix for the document "101287" and look for the terms "textmining", "clustering". But since the document term matrix has changed the document names(rows) to 1,2,3,4... , I can no longer find the document named "101287" and look for the columns "textmining", "clustering". Is there a way I can preserve the document name ? Apologies if I am missing on something..

Output from R for the above code

> library(tm)
  > pathToData = "R/group_data"
  > newsCorpus = Corpus(DirSource(pathToData, recursive = TRUE), 
              readerControl = list(reader = readPlain))

 > length(newsCorpus)
    [1] 1986

 > newsCorpus[["103806"]]
  <<PlainTextDocument (metadata: 7)>>
  From: cheekeen@tartarus.uwa.edu.au (Desmond Chan)
  Subject: Re: Honda clutch chatter
  Organization: The University of Western Australia
  Lines: 8
  NNTP-Posting-Host: tartarus.uwa.edu.au
  X-Newsreader: NN version 6.4.19 #1

  I also experience this kinda problem in my 89 BMW 318. During cold
  start ups, the clutch seems to be sticky and everytime i drive out, for
  about 5km, the clutch seems to stick onto somewhere that if i depress
  the clutch, the whole chassis moves along. But after preheating, it
  becomes smooth again. I think that your suggestion of being some
  humudity is right but there should be some remedy. I also found out that
  my clutch is already thin but still alright for a couple grand more!

 > newsCorpus = tm_map(newsCorpus,removePunctuation)
 > newsCorpus = tm_map(newsCorpus,removeNumbers) 
 > newsCorpus = tm_map(newsCorpus, content_transformer(tolower))
 > newsCorpus = tm_map(newsCorpus, removeWords, stopwords("english")) 
 > newsCorpus = tm_map(newsCorpus, stripWhitespace)

 > newsCorpus = Corpus(VectorSource(newsCorpus)) 

 > docTermMatrix = DocumentTermMatrix(newsCorpus, control = list(weighting =     weightTfIdf,minWordLength = 1,minDocFreq = 1))  
                                                                                              
                                              
 > dim(docTermMatrix)
 [1]  1986 22213



>inspect(docTermMatrix["1","bmw"])
<<DocumentTermMatrix (documents: 1, terms: 1)>>
Non-/sparse entries: 0/1
Sparsity           : 100%
Maximal term length: 3
Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

    Terms
Docs bmw
  1   0

>inspect(docTermMatrix["103806", "bmw"])
Error in `[.simple_triplet_matrix`(docTermMatrix, "103806", "bmw") : 
Subscript out of bounds.
Community
  • 1
  • 1
Deeksha
  • 121
  • 2
  • 11
  • The document term matrix is now a matrix. Did you try `inspect(dtm["101287","textmining"])` to look at the values? You need to use proper row/column indexing. – MrFlick Dec 08 '14 at 20:16
  • Yes, I already tried that and it gives me this error: Error in `[.simple_triplet_matrix`(docTermMatrix, "101287", "textmining") : Subscript out of bounds. I can only run this command on docTermMatrix now : inspect(docTermMatrix[1,1:10]) and it gives me this kind of result : Docs aaa aaaaa aaaah aaareadmetxt aafreenetcarletonca 1 0 0 0 0 0 – Deeksha Dec 08 '14 at 20:35
  • 1
    Well, I was unable to [reproduce](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) your problem based on your description. It worked for me (tm_0.6). – MrFlick Dec 08 '14 at 20:38
  • So if I am understanding it correctly are you able to do inspect(dtm["101287", "textmining"]) and get some result. Well I understand the data might be different. But you are able to access dtm using this notation ? – Deeksha Dec 08 '14 at 20:41
  • Okay so "101287" was the name of my document in the corpus. When it was converted to a DocumentTermMatrix , the size of documentTermMatrix dim(dtm) = 1986 22213. So it has 1986 rows and 22213 cols. Now when i search for document name "101287" it will not exist. So I am trying to figure out how can I access the document with name "101287" and I don't know at which row it is present in DocumentTermMatrix.Name of the document is the only information I have. – Deeksha Dec 08 '14 at 20:59
  • The indexes are independent from the names so that shouldn't matter. You should focus on making your problem [reproducible](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) so we can run the exact same code as you to see why it might not be working for you. – MrFlick Dec 08 '14 at 21:01
  • I have added my R output . I hope this might help in catching some mistake that I am repeatedly making. I have two inspect methods and the second one still gives me the same error. – Deeksha Dec 08 '14 at 21:54
  • BTW , I figured out one thing from the code : If I do not convert the corpus to plain text using this : newsCorpus = Corpus(VectorSource(newsCorpus)). I can access the document from the matrix using document name("103806"). But if I convert it to plain text I cannot. – Deeksha Dec 09 '14 at 19:27

1 Answers1

0

You essentially want to encode your doc's id in the Document Term Matrix. You can do that by saving it as an attribute in your text corpus. Check out this more detailed answer.

Kasia Kulma
  • 1,683
  • 1
  • 14
  • 39