1

I'm new to R. I'm mining data which is present in csv file - summaries of reports in one column, date of report in another column and report's agency in the thrid column. I need to investigate how terms associated with ‘fraud’ have changed over time or vary by agency. I've filtered the rows containing the term 'fraud' and created a new csv file.

How can I create a term freq matrix with years as rows and terms as columns so that I can look for top freq terms and do some clustering?

Basically, I need to create a term frequency matrix of terms against year

Input data: (csv)
**Year**    **Summary** (around 300 words each)    
1945             <text>
1985             <text>
2011             <text>

Desired 0utput : (Term frequency matrix)

       term1     term2    term3  term4 .......
1945     3         5        7       8 .....
1985     1         2        0       7  .....
2011      .            .   .    

Any help would be greatly appreciated.
Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
koder
  • 81
  • 3
  • 9
  • Old question was closed as it wasn't clear. Apologies, I'm new here. Added additional data. – koder May 21 '13 at 18:46
  • For future reference, closed isn't meant to mean "dead". Closed questions are intended as a "time-out" until the question can be improved via editing. (If improvement is not possible, it will generally be deleted, after some period of time.) The "closed" terminology is somewhat misleading in that respect (and is currently undergoing some revision). – joran May 21 '13 at 18:47
  • 1
    Regardless, your question is not quite up to the standards we shoot for around here. For instance, Googling "term frequency matrix in r" quickly leads me to the **tm** package. You should investigate some tools like that first, make some attempts, and _then_ ask for help when some specific piece of code isn't working. – joran May 21 '13 at 18:50
  • Thanks for the response. I'm aware of the tm package and I've tried for long before posting here. But the desired output wasn't obtained through tm package - it basically takes in a corpus of text and created TDM against the document the term is present. Here, my requirement is different. Please correct me if I'm wrong and suggest a solution. Thanks in advance. – koder May 21 '13 at 19:00

1 Answers1

4

In the future please provide a minimal working example.

This isn't exactly using tm but qdap instead as it fits your data type better:

library(qdap)
#create a fake data set (please do this in the future yourself) 
dat <- data.frame(year=1945:(1945+10), summary=DATA$state) 

##    year                               summary
## 1  1945         Computer is fun. Not too fun.
## 2  1946               No it's not, it's dumb.
## 3  1947                    What should we do?
## 4  1948                  You liar, it stinks!
## 5  1949               I am telling the truth!
## 6  1950                How can we be certain?
## 7  1951                      There is no way.
## 8  1952                       I distrust you.
## 9  1953           What are you talking about?
## 10 1954         Shall we move on?  Good then.
## 11 1955 I'm hungry.  Let's eat.  You already?

Now to create the word frequency matrix (similar to a term document matrix):

t(with(dat, wfm(summary, year)))

##      about already am are be ... you
## 1945     0       0  0   0  0       0
## 1946     0       0  0   0  0       0
## 1947     0       0  0   0  0       0
## 1948     0       0  0   0  0       1
## 1949     0       0  1   0  0       0
## 1950     0       0  0   0  1       0
## 1951     0       0  0   0  0       0
## 1952     0       0  0   0  0       1
## 1953     1       0  0   1  0       1
## 1954     0       0  0   0  0       0
## 1955     0       1  0   0  0       1

Or you can create a tru DocumentTermMatrix as of qdap version 1.1.0:

with(dat, dtm(summary, year))

## > with(dat, dtm(summary, year))
## A document-term matrix (11 documents, 41 terms)
## 
## Non-/sparse entries: 51/400
## Sparsity           : 89%
## Maximal term length: 8 
## Weighting          : term frequency (tf)
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • What a handy function! Your `wfm` is much easier than using `DataframeSource` from the `tm` package. – Ben May 21 '13 at 19:28
  • It's easier for this type of data but for corpus data (which tm is designed for) it's much easier. – Tyler Rinker May 21 '13 at 19:50
  • Beautiful! Thanks a lot Tyler. Will try this code. My csv file has 98000 rows. Would you suggest the same code for the whole corpus? Also, how to investigate how terms associated with ‘fraud’ have changed over time or vary by agency? Can we do clustering? – koder May 21 '13 at 20:15
  • These are questions I don't know the answer to. You may want to use correspondence analysis but this is outside of my expertise. Also the use of `t` at the end is unnecessary and may be costly of time on that many rows for no reason. – Tyler Rinker May 21 '13 at 20:29
  • I need to find word associations or top 10/15 words in each year.. so I 'd need to sort the matrix.how can we remove the stop words and do stemming on this corpus, with out making it a text pool as in tm? – koder May 22 '13 at 05:22
  • You need to do some work on your own. Start with `?wfm` and half your question would have been answered. have a look at the [documentation for qdap](https://dl.dropbox.com/u/61803503/qdap.pdf). I know there's a lot but the pdf is searchable. – Tyler Rinker May 22 '13 at 06:12
  • Thanks Tyler. I've split the csv files into individual text files by reading each row and naming them with years using Python, imported them into R and created a TDM, which seemed a better way to go, given my level of knowledge on R. :) Now, I'm stuck at a later stage of process regarding which I posted a new question here : http://stackoverflow.com/questions/16695866/r-finding-the-top-10-terms-associated-with-the-term-fraud-across-documents-i – koder May 22 '13 at 15:35
  • 1
    qdap can do what you want pretty easily but it's difficult to spend time to help you when you haven't even read the docuemntation for the function. If you look at `?wfm` you'd see there's an easy way to deal with stopwords. Search the pdf for stem (or look through the functions list) If you're willing to learn I'm willing to help but I'm not willing to write code for people. – Tyler Rinker May 23 '13 at 00:02
  • Amazing stuff in qdap, Tyler! Thanks a ton. – koder Mar 21 '14 at 09:00