Questions tagged [term-document-matrix]

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.

In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

When creating a database of terms that appear in a set of documents the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. For instance if one has the following two (short) documents:

D1 = "I like databases"

D2 = "I hate hate databases",

then the document-term matrix would be:

/Ilikehatedatabases
D1      1      1      0      1      
D2      1      0      2      1      

which shows which documents contain which terms and how many times they appear. Note that more sophisticated weights can be used; one typical example, among others, would be tf-idf.

Source: http://en.wikipedia.org/wiki/Document-term_matrix

152 questions
48
votes
4 answers

Error converting text to lowercase with tm_map(..., tolower)

I tried using the tm_map. It gave the following error. How can I get around this? require(tm) byword<-tm_map(byword, tolower) Error in UseMethod("tm_map", x) : no applicable method for 'tm_map' applied to an object of class "character"
jackStinger
  • 2,035
  • 5
  • 23
  • 36
21
votes
6 answers

list of word frequencies using R

I have been using the tm package to run some text analysis. My problem is with creating a list with words and their frequencies associated with the same library(tm) library(RWeka) txt <- read.csv("HW.csv",header=T) df <- do.call("rbind",…
ProcRJ
  • 211
  • 1
  • 2
  • 3
17
votes
3 answers

efficient Term Document Matrix with NLTK

I am trying to create a term document matrix with NLTK and pandas. I wrote the following function: def fnDTM_Corpus(xCorpus): import pandas as pd '''to create a Term Document Matrix from a NLTK Corpus''' fd_list = [] for x in…
user1043144
  • 2,680
  • 5
  • 29
  • 45
16
votes
3 answers

How can I tell Solr to return the hit search terms per document?

I have a question about queries in Solr. When I perform a query with multiple search terms that are all logically linked by OR (e.g. q=content:(foo OR bar OR foobar)) than Solr returns a list of documents that all matches any of these terms. But…
tbmsu
  • 352
  • 3
  • 13
13
votes
4 answers

More efficient means of creating a corpus and DTM with 4M rows

My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier. Consider the following code: library(tm) GetCorpus <-function(textVector) { …
user1477388
  • 20,790
  • 32
  • 144
  • 264
12
votes
3 answers

TermDocumentMatrix errors in R

I have been working through numerous online examples of the {tm} package in R, attempting to create a TermDocumentMatrix. Creating and cleaning a corpus has been pretty straightforward, but I consistently encounter an error when I attempt to create…
Brian P
  • 1,496
  • 4
  • 25
  • 38
7
votes
3 answers

TermDocumentMatrix sometimes throwing error

I am creating a Word Cloud based on Tweets from various different sports teams. This code executes successfully about 1 in 10 times: handle <- 'arsenal' txt <- searchTwitter(handle,n=1000,lang='en') t <- sapply(txt,function(x) x$getText()) t <-…
Dan
  • 524
  • 1
  • 5
  • 17
7
votes
1 answer

R tm package create matrix of Nmost frequent terms

I have a termDocumentMatrix created using the tm package in R. I'm trying to create a matrix/dataframe that has the 50 most frequently occurring terms. When I try to convert to a matrix I get this error: > ap.m <- as.matrix(mydata.dtm) Error: cannot…
screechOwl
  • 27,310
  • 61
  • 158
  • 267
6
votes
2 answers

Creating N-Grams with tm & RWeka - works with VCorpus but not Corpus

Following the many guides to creating biGrams using the 'tm' and 'RWeka' packages, I was getting frustrated that only 1-Grams were being returned in the tdm. Through much trial and error I discovered that proper function was achieved using 'VCorpus'…
Paul_J
  • 61
  • 1
  • 4
6
votes
1 answer

Big Text Corpus breaks tm_map

I have been breaking my head over this one over the last few days. I searched all the SO archives and tried the suggested solutions but just can't seem to get this to work. I have sets of txt documents in folders such as 2000 06, 1995 -99 etc, and…
Kartik
  • 75
  • 1
  • 6
6
votes
1 answer

How to efficiently compute similarity between documents in a stream of documents

I gather Text documents (in Node.js) where one document i is represented as a list of words. What is an efficient way to compute the similarity between these documents, taking into account that new documents are coming as a sort of stream of…
5
votes
1 answer

Creating a sparse matrix from a TermDocumentMatrix

I've created a TermDocumentMatrix from the tm library in R. It looks something like this: > inspect(freq.terms) A document-term matrix (19 documents, 214 terms) Non-/sparse entries: 256/3810 Sparsity : 94% Maximal term length: 19…
user1988898
  • 185
  • 2
  • 9
5
votes
2 answers

How to build a Term-Document-Matrix from a set of texts and a specific set of terms (tags)?

I have two sets of data: a set of tags (single words like php, html, etc) a set of texts I wish now to build a Term-Document-Matrix representing the number occurrences of the tags element in the text element. I have looked into R library tm, and…
Timothée HENRY
  • 14,294
  • 21
  • 96
  • 136
4
votes
1 answer

how to calculate term-document matrix?

I know that Term-Document Matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. I am…
log0
  • 2,206
  • 2
  • 14
  • 24
4
votes
1 answer

Creating a term-document matrix in Python from ElasticSearch index

ElasticSearch newbie here. I have a set of text documents which I've indexed using ElasticSearch through the Python ElasticSearch client. Now I want to do some machine learning with the documents using Python and scikit-learn. I need to accomplish…
plam
  • 1,305
  • 3
  • 15
  • 24
1
2 3
10 11