Questions tagged [document-classification]

Document classification is the act of assigning documents from a given set of documents to any of a number of classes, where those classes are known a priori.

Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is used mainly in information science and computer science. The problems are overlapping, however, and there is therefore also interdisciplinary research on document classification.

227 questions

votes

3 answers

scikit-learn TfidfVectorizer meaning?

I was reading about TfidfVectorizer implementation of scikit-learn, i don´t understand what´s the output of the method, for example: new_docs = ['He watches basketball and baseball', 'Julie likes to play basketball', 'Jane loves to play…

asked Sep 17 '14 at 23:50

anon

votes

3 answers

How to calculate TF*IDF for a single new document to be classified?

I am using document-term vectors to represent a collection of document. I use TF*IDF to calculate the term weight for each document vector. Then I could use this matrix to train a model for document classification. I am looking forward to classify…

machine-learning classification information-retrieval text-mining document-classification

asked Apr 01 '14 at 15:59

smwikipedia

61,609
92
309
482

votes

5 answers

What tried and true algorithms for suggesting related articles are out there?

Pretty common situation, I'd wager. You have a blog or news site and you have plenty of articles or blags or whatever you call them, and you want to, at the bottom of each, suggest others that seem to be related. Let's assume very little metadata…

text machine-learning information-retrieval document-classification

asked Aug 10 '09 at 12:38

kch

77,385
46
136
148

votes

7 answers

Text classification/categorization algorithm

My objective is to [semi]automatically assign texts to different categories. There's a set of user defined categories and a set of texts for each category. The ideal algorithm should be able to learn from a human-defined classification and then…

algorithm text-mining document-classification

asked Aug 27 '10 at 13:12

Max

19,654
13
84
122

votes

3 answers

Supervised Latent Dirichlet Allocation for Document Classification?

I have a bunch of already human-classified documents in some groups. Is there a modified version of lda which I can use to train a model and then later classify unknown documents with it?

machine-learning nlp classification document-classification lda

asked Nov 25 '12 at 20:12

snøreven

1,904
2
19
39

votes

4 answers

Scalable or online out-of-core multi-label classifiers

I have been blowing my brains out over the past 2-3 weeks on this problem. I have a multi-label (not multi-class) problem where each sample can belong to several of the labels. I have around 4.5 million text documents as training data and around 1…

machine-learning classification scikit-learn document-classification text-classification

asked Sep 08 '13 at 14:43

Gaurav Kumar

votes

8 answers

Understanding Bayes' Theorem

I'm working on an implementation of a Naive Bayes Classifier. Programming Collective Intelligence introduces this subject by describing Bayes Theorem as: Pr(A | B) = Pr(B | A) x Pr(A)/Pr(B) As well as a specific example relevant to document…

statistics bayesian naivebayes document-classification

asked Dec 29 '09 at 11:59

benmcredmond

1,702
2
15
22

votes

2 answers

Create_Analytics in RTextTools

I trying to classify Text documents into number of categories. My below code works fine matrix[[i]] <- create_matrix(trainingdata[[i]][,1], language="english",removeNumbers=FALSE,stemWords=FALSE,weighting=weightTf,minWordLength=3) …

r precision text-mining document-classification confusion-matrix

asked May 09 '14 at 09:40

Prasanna Nandakumar

4,295
34
63

votes

4 answers

Dictionary words for download

Can someone offer a suggestion on where to find a dictionary word list with frequency information? Ideally, the source would be English words of the North American variety.

nlp document-classification

asked Nov 20 '10 at 18:46

AlgoMan

2,785
6
34
40

votes

1 answer

NLTK - Multi-labeled Classification

I am using NLTK, to classify documents - having 1 label each, with there being 10 type of documents. For text extraction, I am cleaning text (punctuation removal, html tag removal, lowercasing), removing nltk.corpus.stopwords, as well as my own…

python nlp nltk document-classification

asked May 09 '14 at 18:39

redrubia

2,256
6
33
47

votes

3 answers

Multi-Label Document Classification

I have a database in which I store data based upon the following three fields: id, text, {labels}. Note that each text has been assigned to more than one label \ tag \ class. I want to build a model (weka \ rapidminer \ mahout) that will be able to…

java machine-learning text-mining document-classification

asked May 21 '13 at 15:06

user2295350

votes

1 answer

How do you initialize a gensim corpus variable with a csr_matrix?

I have X as a csr_matrix that I obtained using scikit's tfidf vectorizer, and y which is an array My plan is to create features using LDA, however, I failed to find how to initialize a gensim's corpus variable with X as a csr_matrix. In other words,…

python scikit-learn document-classification lda gensim

asked Mar 27 '13 at 22:12

IssamLaradji

6,637
8
43
68

votes

4 answers

text categorization classifiers

Does anybody know of good open-source text-categorization models? I know about Stanford Classifier, Weka, Mallet, etc. but all of them require training. I need to classify news articles into Sports/Politics/Health/Gaming/etc. Is there any…

java machine-learning classification document-classification categorization

asked Mar 07 '13 at 15:16

MFARID

votes

3 answers

Which classification algorithm can be used for document categorization?

Hey, Here is my problem, Given a set of documents I need to assign each document to a predefined category. I was going to use the n-gram approach to represent the text-content of each document and then train an SVM classifier on the training data…

algorithm machine-learning classification document-classification

asked Aug 20 '12 at 01:54

TeFa

votes

3 answers

Basic text classification with Weka in Java

Im trying to build a text classifier in JAVA with Weka. I have read some tutorials, and I´m trying to build my own classifier. I have the following categories: computer,sport,unknown and the following already trained data cs belongs to…

java classification weka document-classification

asked Mar 14 '12 at 18:22

joxxe

2 3

…

15 16 Next