Questions tagged [term-document-matrix]

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.

In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.

When creating a database of terms that appear in a set of documents the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. For instance if one has the following two (short) documents:

D1 = "I like databases"

D2 = "I hate hate databases",

then the document-term matrix would be:

/Ilikehatedatabases
D1 1 1 0 1
D2 1 0 2 1

which shows which documents contain which terms and how many times they appear. Note that more sophisticated weights can be used; one typical example, among others, would be tf-idf.

Source: http://en.wikipedia.org/wiki/Document-term_matrix

152 questions

votes

4 answers

Error converting text to lowercase with tm_map(..., tolower)

I tried using the tm_map. It gave the following error. How can I get around this? require(tm) byword<-tm_map(byword, tolower) Error in UseMethod("tm_map", x) : no applicable method for 'tm_map' applied to an object of class "character"

r tm lowercase term-document-matrix

asked Nov 30 '12 at 06:35

jackStinger

2,035
5
23
36

votes

6 answers

list of word frequencies using R

I have been using the tm package to run some text analysis. My problem is with creating a list with words and their frequencies associated with the same library(tm) library(RWeka) txt <- read.csv("HW.csv",header=T) df <- do.call("rbind",…

r text-mining word-frequency term-document-matrix

asked Aug 07 '13 at 10:30

ProcRJ

votes

3 answers

efficient Term Document Matrix with NLTK

I am trying to create a term document matrix with NLTK and pandas. I wrote the following function: def fnDTM_Corpus(xCorpus): import pandas as pd '''to create a Term Document Matrix from a NLTK Corpus''' fd_list = [] for x in…

python pandas nltk term-document-matrix

asked Apr 09 '13 at 10:46

user1043144

2,680
5
29
45

votes

3 answers

How can I tell Solr to return the hit search terms per document?

I have a question about queries in Solr. When I perform a query with multiple search terms that are all logically linked by OR (e.g. q=content:(foo OR bar OR foobar)) than Solr returns a list of documents that all matches any of these terms. But…

solr term-document-matrix

asked Jul 30 '14 at 13:27

tbmsu

votes

4 answers

More efficient means of creating a corpus and DTM with 4M rows

My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier. Consider the following code: library(tm) GetCorpus <-function(textVector) { …

r data.table corpus term-document-matrix qdap

asked Aug 15 '14 at 16:57

user1477388

20,790
32
144
264

votes

3 answers

TermDocumentMatrix errors in R

I have been working through numerous online examples of the {tm} package in R, attempting to create a TermDocumentMatrix. Creating and cleaning a corpus has been pretty straightforward, but I consistently encounter an error when I attempt to create…

r text-mining tm corpus term-document-matrix

asked Aug 28 '14 at 14:36

Brian P

1,496
4
25
38

votes

3 answers

TermDocumentMatrix sometimes throwing error

I am creating a Word Cloud based on Tweets from various different sports teams. This code executes successfully about 1 in 10 times: handle <- 'arsenal' txt <- searchTwitter(handle,n=1000,lang='en') t <- sapply(txt,function(x) x$getText()) t <-…

r word-cloud term-document-matrix

asked Sep 06 '14 at 10:31

Dan

votes

1 answer

R tm package create matrix of Nmost frequent terms

I have a termDocumentMatrix created using the tm package in R. I'm trying to create a matrix/dataframe that has the 50 most frequently occurring terms. When I try to convert to a matrix I get this error: > ap.m <- as.matrix(mydata.dtm) Error: cannot…

r text-mining tm term-document-matrix

asked Jul 16 '12 at 16:42

screechOwl

27,310
61
158
267

votes

2 answers

Creating N-Grams with tm & RWeka - works with VCorpus but not Corpus

Following the many guides to creating biGrams using the 'tm' and 'RWeka' packages, I was getting frustrated that only 1-Grams were being returned in the tdm. Through much trial and error I discovered that proper function was achieved using 'VCorpus'…

r tm n-gram term-document-matrix rweka

asked Mar 13 '17 at 05:33

Paul_J

votes

1 answer

Big Text Corpus breaks tm_map

I have been breaking my head over this one over the last few days. I searched all the SO archives and tried the suggested solutions but just can't seem to get this to work. I have sets of txt documents in folders such as 2000 06, 1995 -99 etc, and…

r text-mining tm text-analysis term-document-matrix

asked Nov 09 '14 at 23:30

Kartik

votes

1 answer

How to efficiently compute similarity between documents in a stream of documents

I gather Text documents (in Node.js) where one document i is represented as a list of words. What is an efficient way to compute the similarity between these documents, taking into account that new documents are coming as a sort of stream of…

node.js stream nlp cosine-similarity term-document-matrix

asked Dec 21 '12 at 08:17

Alexandre Kaspar

votes

1 answer

Creating a sparse matrix from a TermDocumentMatrix

I've created a TermDocumentMatrix from the tm library in R. It looks something like this: > inspect(freq.terms) A document-term matrix (19 documents, 214 terms) Non-/sparse entries: 256/3810 Sparsity : 94% Maximal term length: 19…

r sparse-matrix tm term-document-matrix

asked Feb 10 '14 at 21:13

user1988898

votes

2 answers

How to build a Term-Document-Matrix from a set of texts and a specific set of terms (tags)?

I have two sets of data: a set of tags (single words like php, html, etc) a set of texts I wish now to build a Term-Document-Matrix representing the number occurrences of the tags element in the text element. I have looked into R library tm, and…

r term-document-matrix

asked Oct 31 '13 at 11:56

Timothée HENRY

14,294
21
96
136

votes

1 answer

how to calculate term-document matrix?

I know that Term-Document Matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. I am…

python scikit-learn scipy term-document-matrix

asked Apr 01 '17 at 07:23

log0

2,206
2
14
24

votes

1 answer

Creating a term-document matrix in Python from ElasticSearch index

ElasticSearch newbie here. I have a set of text documents which I've indexed using ElasticSearch through the Python ElasticSearch client. Now I want to do some machine learning with the documents using Python and scikit-learn. I need to accomplish…

python elasticsearch machine-learning term-document-matrix

asked Jun 02 '15 at 06:05

plam

1,305
3
15
24

2 3

…

10 11 Next