Questions tagged [document-classification]

Document classification is the act of assigning documents from a given set of documents to any of a number of classes, where those classes are known a priori.

Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is used mainly in information science and computer science. The problems are overlapping, however, and there is therefore also interdisciplinary research on document classification.

227 questions
25
votes
3 answers

scikit-learn TfidfVectorizer meaning?

I was reading about TfidfVectorizer implementation of scikit-learn, i don´t understand what´s the output of the method, for example: new_docs = ['He watches basketball and baseball', 'Julie likes to play basketball', 'Jane loves to play…
25
votes
3 answers

How to calculate TF*IDF for a single new document to be classified?

I am using document-term vectors to represent a collection of document. I use TF*IDF to calculate the term weight for each document vector. Then I could use this matrix to train a model for document classification. I am looking forward to classify…
25
votes
5 answers

What tried and true algorithms for suggesting related articles are out there?

Pretty common situation, I'd wager. You have a blog or news site and you have plenty of articles or blags or whatever you call them, and you want to, at the bottom of each, suggest others that seem to be related. Let's assume very little metadata…
kch
  • 77,385
  • 46
  • 136
  • 148
15
votes
7 answers

Text classification/categorization algorithm

My objective is to [semi]automatically assign texts to different categories. There's a set of user defined categories and a set of texts for each category. The ideal algorithm should be able to learn from a human-defined classification and then…
Max
  • 19,654
  • 13
  • 84
  • 122
14
votes
3 answers

Supervised Latent Dirichlet Allocation for Document Classification?

I have a bunch of already human-classified documents in some groups. Is there a modified version of lda which I can use to train a model and then later classify unknown documents with it?
snøreven
  • 1,904
  • 2
  • 19
  • 39
13
votes
4 answers

Scalable or online out-of-core multi-label classifiers

I have been blowing my brains out over the past 2-3 weeks on this problem. I have a multi-label (not multi-class) problem where each sample can belong to several of the labels. I have around 4.5 million text documents as training data and around 1…
11
votes
8 answers

Understanding Bayes' Theorem

I'm working on an implementation of a Naive Bayes Classifier. Programming Collective Intelligence introduces this subject by describing Bayes Theorem as: Pr(A | B) = Pr(B | A) x Pr(A)/Pr(B) As well as a specific example relevant to document…
benmcredmond
  • 1,702
  • 2
  • 15
  • 22
8
votes
2 answers

Create_Analytics in RTextTools

I trying to classify Text documents into number of categories. My below code works fine matrix[[i]] <- create_matrix(trainingdata[[i]][,1], language="english",removeNumbers=FALSE,stemWords=FALSE,weighting=weightTf,minWordLength=3) …
7
votes
4 answers

Dictionary words for download

Can someone offer a suggestion on where to find a dictionary word list with frequency information? Ideally, the source would be English words of the North American variety.
AlgoMan
  • 2,785
  • 6
  • 34
  • 40
7
votes
1 answer

NLTK - Multi-labeled Classification

I am using NLTK, to classify documents - having 1 label each, with there being 10 type of documents. For text extraction, I am cleaning text (punctuation removal, html tag removal, lowercasing), removing nltk.corpus.stopwords, as well as my own…
redrubia
  • 2,256
  • 6
  • 33
  • 47
7
votes
3 answers

Multi-Label Document Classification

I have a database in which I store data based upon the following three fields: id, text, {labels}. Note that each text has been assigned to more than one label \ tag \ class. I want to build a model (weka \ rapidminer \ mahout) that will be able to…
7
votes
1 answer

How do you initialize a gensim corpus variable with a csr_matrix?

I have X as a csr_matrix that I obtained using scikit's tfidf vectorizer, and y which is an array My plan is to create features using LDA, however, I failed to find how to initialize a gensim's corpus variable with X as a csr_matrix. In other words,…
IssamLaradji
  • 6,637
  • 8
  • 43
  • 68
7
votes
4 answers

text categorization classifiers

Does anybody know of good open-source text-categorization models? I know about Stanford Classifier, Weka, Mallet, etc. but all of them require training. I need to classify news articles into Sports/Politics/Health/Gaming/etc. Is there any…
7
votes
3 answers

Which classification algorithm can be used for document categorization?

Hey, Here is my problem, Given a set of documents I need to assign each document to a predefined category. I was going to use the n-gram approach to represent the text-content of each document and then train an SVM classifier on the training data…
6
votes
3 answers

Basic text classification with Weka in Java

Im trying to build a text classifier in JAVA with Weka. I have read some tutorials, and I´m trying to build my own classifier. I have the following categories: computer,sport,unknown and the following already trained data cs belongs to…
joxxe
  • 261
  • 1
  • 3
  • 13
1
2 3
15 16