Dear all I am working on a project in which I have to categories research papers into their appropriate fields using titles of papers. For example if a phrase "computer network" occurs somewhere in then title then this paper should be tagged as related to the concept "computer network". I have 3 million titles of research papers. So I want to know how I should start. I have tried to use tf-idf but could not get actual results. Does someone know about a library to do this task easily? Kindly suggest one. I shall be thankful.
-
33 million titles of research papers and people ask in stack overflow. that's the end of academy as we know it :-) – Leo Mar 20 '14 at 19:48
-
you may find this community with smarter people to answer this question than this one :-) http://stats.stackexchange.com – Leo Mar 20 '14 at 19:50
-
1do you precisely know the number of categories in advance? (for example Medicine, Mechanics, IT, Aerospace...) or are you going to build it automagically? – HAL9000 Mar 20 '14 at 19:57
-
1are categories disjoint sets or is a paper allowed to be in two or more categories? – HAL9000 Mar 20 '14 at 20:02
-
No I don't know categories in advance, the thing that I know is that all papers are related to IT – user3313379 Mar 21 '14 at 03:32
-
relevant blogpost: http://aqibsaeed.github.io/2016-07-26-text-classification/ and python notebook: https://github.com/aqibsaeed/Research-Paper-Categorization – philshem Apr 15 '19 at 12:39
3 Answers
If you don't know categories in advance, than it's not classification, but instead clustering. Basically, you need to do following:
- Select algorithm.
- Select and extract features.
- Apply algorithm to features.
Quite simple. You only need to choose combination of algorithm and features that fits your case best.
When talking about clustering, there are several popular choices. K-means is considered one of the best and has enormous number of implementations, even in libraries not specialized in ML. Another popular choice is Expectation-Maximization (EM) algorithm. Both of them, however, require initial guess about number of classes. If you can't predict number of classes even approximately, other algorithms - such as hierarchical clustering or DBSCAN - may work for you better (see discussion here).
As for features, words themselves normally work fine for clustering by topic. Just tokenize your text, normalize and vectorize words (see this if you don't know what it all means).
Some useful links:
- Clustering text documents using k-means
- NLTK clustering package
- Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Note: all links in this answer are about Python, since it has really powerful and convenient tools for this kind of tasks, but if you have another language of preference, you most probably will be able to find similar libraries for it too.
For Python, I would recommend NLTK (Natural Language Toolkit), as it has some great tools for converting your raw documents into features you can feed to a machine learning algorithm. For starting out, you can maybe try a simple word frequency model (bag of words) and later on move to more complex feature extraction methods (string kernels). You can start by using SVM's (Support Vector Machines) to classify the data using LibSVM (the best SVM package).

- 734
- 5
- 10
The fact, that you do not know the number of categories in advance, you could use a tool called OntoGen. The tool basically takes a set of texts, does some text mining, and tries to discover the clusters of documents. It is a semi-supervised tool, so you must guide the process a little, but it does wonders. The final product of the process is an ontology of topics.
I encourage you, to give it a try.

- 692
- 6
- 13