Multiclass Text Classification of Wikipedia Articles

Question

I have a collection of Wikipedia dumps. I need to classify them in a list of categories that I have. The categories are like, Sports, Law, Music, Movie, etc. There are around 300 categories. I extracted the descriptions from the articles, and the category list of the articles.

What I observed is that the nouns in the descriptions give a pretty good idea of the entity. For example, the first sentence of the description is the most important. For example, in "Christiano Ronaldo is a Portuguese professional footballer", footballer is a noun. Also in the list of categories on Ronaldo's wikipedia page, the words 'footballer' and 'football' are repeated multiple times.

Considering that, I cleaned all the description using Natural Language Processing and extracted only the nouns out of the data. Overall, there would be 20,000 different words (nouns) in the complete corpus of the data that I have. As mentioned before, there are around 300 classes.

What I've done till now is pretty basic. I extract the nouns from the data about an entity. Then I process them with NLP techniques like stemming. Then I use GloVe to get vectors of the most common nouns in the description, and the categories. Then I find the category with the nearest cosine distance of the vectors representing the common word and the category. For example, if the most common word in the data about an entity is 'Novel'. Then the cosine distance of the word 'Novel' and the category 'Book' is small and I output that the entity is a 'Book'. But this gives an accuracy of around 60% which is not good.

Hence, I would like to use Deep Learning using Tensorflow, or some other library that does the classification for me.

My input vector would be of the form [0,1,0,6,0,...,10,0,..] of width 20,000, the number of distinct words, where each index corresponds to the frequency of the word (this is important as frequency is of prime importance for me) occurring in the description. The output should be of the form [0.12, 0.2, 0.01, 0.00,.....,0.7, 0.14....] of width 300. Where some classes have high values, some have low depending upon the entity's description.

I have a training data of sufficient size as well. I have around a million entities with the description and their correct labels, which should be sufficient I guess. I am new to Deep Learning and would like if I get a raw structure of the code which I can play with and learn simultaneously. I have some basic understanding of Tensorflow and Keras, but its difficult for me to proceed.

I am also aware that I can use Pre-trained embeddings on Wikipedia corpus, GloVe and Word2Vec. Any help would be great.

Example: From this Wikipedia page, I have extracted the following information:

Categories: Novels by Sue Grafton,Kinsey Millhone novels,2005 American novels,1953 in fiction,1987 in fiction,Novels set in California,G. P. Putnam's Sons books,2000s mystery novel stubs

Description: It is the 19th novel in Sue Grafton's Alphabet series of mystery novels and features Kinsey Millhone, a private eye based in Santa Teresa, California.

Based on which it should be classified as 'book' or 'novel'.

EDIT: Because I'm new to Tensorflow, what I want is a code that I can understand and change according to my needs. Some basic structure of a similar code from any resource will do.

Sorry for being unclear. I wanted some resource where this problem is addressed, or a structure of a code that can be used for this purpose. — alpha_beta_gamma, Jun 15 '18 at 06:50
Such questions are off-topic in SO; please see [What topics can I ask about here?](https://stackoverflow.com/help/on-topic) — desertnaut, Jun 15 '18 at 06:51
Sorry about that. Would you know a forum where I can post this? — alpha_beta_gamma, Jun 15 '18 at 06:53
I followed [this](https://stackoverflow.com/questions/35400065/multilabel-text-classification-using-tensorflow) question and asked a similar, but not the same one here. — alpha_beta_gamma, Jun 15 '18 at 06:57

dennlinger · Answer 1 · 2018-06-15T07:21:35.710

Stackoverflow is not a place to ask for readily available code.

I can, however, point you in a general direction, which should be enough information to get you to some tutorials doing similar (enough) topics.
Your problem can certainly be approached in several different ways. As your bag-of-words (which is what your "counting vector" is called) drops any ordering of words, and since your input is effectively the same length, it can be processed with standard techniques like a MLP (Multi Layer Perceptron).
A Word2Vec model only makes sense if you have specific one-hot vectors (not indicating the frequency, but what words appear at all). Yet, a simple embedding layer might work just as well; that generally only means that you embed your high-dimensional space (in your case ~20.000), to a lower dimension. There is no general rule of thumb as to how many dimensions will best approximate your problem without being too sparse, but I would say a dimension around 100-150 should work well.

In a MLP, you basically only stack non-linear layers one after another. You can follow a basic MNIST example for multiclass classification (note that this is on an older version of TensorFlow, but should still work the same); they even have two different approaches that you can both apply to yours.

That should get you on the right path, and you can see whether your problem is getting a somewhat decent solution or remains hard to tackle. Also consider how your classes are distributed; with 300 categories, it still might well be that certain classes are underrepresented, and you run into issues with the training process.

Did you do a basic statistical analysis on the distribution yet?

Edit: Included comment from below. To add to the MNIST example: Even though you have different data origins, the data can be modeled the same. You have high-dimensional non-binary inputs, and want to get only a few output labels that represent some form of statistical likelihood of being of a certain class.

Note that your input and output are the same as in the MNIST example, in the sense that you have a high-dimensional input of non-binary values, and a lower-dimensional output. The MNIST example performs an additional operation called _softmax_ in the end (which makes the result easier to handle internally, as well as more representative in a statistical sense), and then simply assigns the category as the maximum value in the result (which is still of whatever length you specify it to be). — dennlinger, Jun 15 '18 at 07:04
You can always edit & update your post - no need to put such lengthy clarifications in the comments... — desertnaut, Jun 15 '18 at 07:16
I don't think it is fully relevant to my answer, but a mere suggestion to his updates. Kind of like a PS note. Will incorporate anyways. — dennlinger, Jun 15 '18 at 07:17

Multiclass Text Classification of Wikipedia Articles

1 Answers1