Research papers classification on the basis of title of the research paper

Question

Dear all I am working on a project in which I have to categories research papers into their appropriate fields using titles of papers. For example if a phrase "computer network" occurs somewhere in then title then this paper should be tagged as related to the concept "computer network". I have 3 million titles of research papers. So I want to know how I should start. I have tried to use tf-idf but could not get actual results. Does someone know about a library to do this task easily? Kindly suggest one. I shall be thankful.

3 million titles of research papers and people ask in stack overflow. that's the end of academy as we know it :-) — Leo, Mar 20 '14 at 19:48
you may find this community with smarter people to answer this question than this one :-) http://stats.stackexchange.com — Leo, Mar 20 '14 at 19:50
do you precisely know the number of categories in advance? (for example Medicine, Mechanics, IT, Aerospace...) or are you going to build it automagically? — HAL9000, Mar 20 '14 at 19:57
are categories disjoint sets or is a paper allowed to be in two or more categories? — HAL9000, Mar 20 '14 at 20:02
No I don't know categories in advance, the thing that I know is that all papers are related to IT — user3313379, Mar 21 '14 at 03:32
relevant blogpost: http://aqibsaeed.github.io/2016-07-26-text-classification/ and python notebook: https://github.com/aqibsaeed/Research-Paper-Categorization — philshem, Apr 15 '19 at 12:39

score 1 · Accepted Answer · edited May 23 '17 at 11:53

If you don't know categories in advance, than it's not classification, but instead clustering. Basically, you need to do following:

Select algorithm.
Select and extract features.
Apply algorithm to features.

Quite simple. You only need to choose combination of algorithm and features that fits your case best.

When talking about clustering, there are several popular choices. K-means is considered one of the best and has enormous number of implementations, even in libraries not specialized in ML. Another popular choice is Expectation-Maximization (EM) algorithm. Both of them, however, require initial guess about number of classes. If you can't predict number of classes even approximately, other algorithms - such as hierarchical clustering or DBSCAN - may work for you better (see discussion here).

As for features, words themselves normally work fine for clustering by topic. Just tokenize your text, normalize and vectorize words (see this if you don't know what it all means).

Some useful links:

Note: all links in this answer are about Python, since it has really powerful and convenient tools for this kind of tasks, but if you have another language of preference, you most probably will be able to find similar libraries for it too.

score 0 · Answer 2 · answered Mar 21 '14 at 04:25

For Python, I would recommend NLTK (Natural Language Toolkit), as it has some great tools for converting your raw documents into features you can feed to a machine learning algorithm. For starting out, you can maybe try a simple word frequency model (bag of words) and later on move to more complex feature extraction methods (string kernels). You can start by using SVM's (Support Vector Machines) to classify the data using LibSVM (the best SVM package).

score 0 · Answer 3 · answered Mar 21 '14 at 10:52

The fact, that you do not know the number of categories in advance, you could use a tool called OntoGen. The tool basically takes a set of texts, does some text mining, and tries to discover the clusters of documents. It is a semi-supervised tool, so you must guide the process a little, but it does wonders. The final product of the process is an ontology of topics.

I encourage you, to give it a try.

Research papers classification on the basis of title of the research paper

3 Answers3