15

My objective is to [semi]automatically assign texts to different categories. There's a set of user defined categories and a set of texts for each category. The ideal algorithm should be able to learn from a human-defined classification and then classify new texts automatically. Can anybody suggest such an algorithm and perhaps .NET library that implements ше?

Max
  • 19,654
  • 13
  • 84
  • 122

7 Answers7

19

Doing this is not trivial. Obviously you can build a dictionary that maps certain keywords to categories. Just finding a keyword would suggest a certain category.

Yet, in natural language text, the keywords would usually not be in their stem form. You would need some morphology tools to find the stem form and use it on the dictionary.

But then somebody could write something like: "This article is not about ...". This would introduce the need for syntax and semantical analysis.

And then you would find that certain keywords can be used in several categories: "band" could be used in musics, Technics, or even handicraft work. You would therefore need an ontology and statistical or other methods to weigh the probability of the category to choose if not definite.

Some of the keywords might not even be easy to fit into an ontology: is mathematician closer to programmer or gardener? But you said in your question that the categories are built by men, so they could also help building the ontology.

Have a look on computational linguistics here and in Wikipedia for further studies.

Now, the more narrow the field your texts are from, the more structured they are, and the smaller the vocabulary, the easier the problem becomes.

Again some keywords for further studies: morphology, syntax analysis, semantics, ontology, computational linguistics, indexing, keywording

Community
  • 1
  • 1
Ralph M. Rickenbach
  • 12,893
  • 5
  • 29
  • 49
7

There are multiple approaches to automatic text classification. A naive Bayes classifier is possibly the simplest of them. Another one is the K-nearest neighbor that you can use. This google answer on categorization of text might help you.

Gangadhar
  • 1,893
  • 9
  • 9
  • A Up-V for the link that you provided. the answers was rigorously researched and the kind of information summarized there was astounding. Thanks! I wish I could give 10 votes to an answer. – Fr0zenFyr Jan 09 '15 at 08:06
  • The link is good and probably reasonably stable, but Stack Overflow answers should be self-contained. Could you at least briefly summarize the resource you are linking to? – tripleee Jan 08 '16 at 10:14
5

Watch my video series on exactly this topic.

http://vancouverdata.blogspot.com/2010/11/text-analytics-with-rapidminer-loading.html

Classification is in video 5, but the other videos may help you get up to speed.

It's all based on the FOSS program RapidMiner.

Neil McGuigan
  • 46,580
  • 12
  • 123
  • 152
3

Check out this example from scikit learn. There is a whole bunch of different algorithms applied in the example so you can compare the results.

rptwsthi
  • 10,094
  • 10
  • 68
  • 109
Diego
  • 812
  • 7
  • 25
  • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/low-quality-posts/10997513) – Aᴍɪʀ Jan 23 '16 at 06:54
  • In this case the answer is really to use that particular framework and I have named it in my answer. Which algorithm will fit better depends on the data. – Diego Jan 23 '16 at 09:41
2

Support vector machine. Everyone loves support vector machines. You'll need to do quite a bit of reading, and perhaps even buy a book. But you could start by reading a paper to see if you like the idea.

Tom Anderson
  • 46,189
  • 17
  • 92
  • 133
  • 1
    A friend who knows a lot more about this than me says "An SVM would indeed be a smart choice Tom. There are more efficient techniques that will give you similar results if you have large datasets though... how many training samples per category?", to which i replied "Not sure, i ask for someone else. But not a lot, i think.", to which he in turn replied "OK well the simple answer is that an SVM would be a good place to start.". So now you know. – Tom Anderson Aug 30 '10 at 21:46
  • 3
    harder to do multi-class classification with SVM. much easier with naive bayes or knn – Neil McGuigan Dec 11 '10 at 21:18
1

I've been looking for the answer to this question for quite a while. Today I found my answer.

There is an open-source program called "dbacl" that does this. It classifies documents into as many categories as you like (up to a certain maximum).

The other answers saying things like "not trivial" are all true, but having an easy-to-use package that does the hard stuff helps a lot at making it manageable.

rew
  • 436
  • 5
  • 10
  • While this is a useful off-the-shelf utility, the question, and this site, are about programming problems, not finding useful utilities. Thus, this answer should perhaps be a comment instead. – tripleee Jan 08 '16 at 10:06
  • Agreed, if "programming" is the topic, a standard utility is offtopic. On the other hand, an open source program allows you to investigate it and extract the algorithms used. I've taken the original question as: "I have this problem I want to solve, and I'm willing to program it myself if necessary". In that light a standard utility will help the original asker as well as people who end up here with a similar problem. ". – rew Jan 25 '16 at 14:10
1

The general term for these methods is "multivariate methods". That with a search on "text classification" or "text categorization" should bring up some useful leads. Good luck !

Grembo
  • 1,223
  • 7
  • 6