I have a collection of Wikipedia dumps. I need to classify them in a list of categories that I have. The categories are like, Sports, Law, Music, Movie, etc. There are around 300 categories. I extracted the descriptions from the articles, and the category list of the articles.
What I observed is that the nouns in the descriptions give a pretty good idea of the entity. For example, the first sentence of the description is the most important. For example, in "Christiano Ronaldo is a Portuguese professional footballer", footballer is a noun. Also in the list of categories on Ronaldo's wikipedia page, the words 'footballer' and 'football' are repeated multiple times.
Considering that, I cleaned all the description using Natural Language Processing and extracted only the nouns out of the data. Overall, there would be 20,000 different words (nouns) in the complete corpus of the data that I have. As mentioned before, there are around 300 classes
.
What I've done till now is pretty basic. I extract the nouns from the data about an entity. Then I process them with NLP techniques like stemming. Then I use GloVe to get vectors of the most common nouns in the description, and the categories. Then I find the category with the nearest cosine distance of the vectors representing the common word and the category. For example, if the most common word in the data about an entity is 'Novel'. Then the cosine distance of the word 'Novel' and the category 'Book' is small and I output that the entity is a 'Book'. But this gives an accuracy of around 60%
which is not good.
Hence, I would like to use Deep Learning using Tensorflow, or some other library that does the classification for me.
My input vector would be of the form [0,1,0,6,0,...,10,0,..]
of width 20,000
, the number of distinct words, where each index corresponds to the frequency of the word (this is important as frequency is of prime importance for me) occurring in the description. The output should be of the form [0.12, 0.2, 0.01, 0.00,.....,0.7, 0.14....]
of width 300
. Where some classes have high values, some have low depending upon the entity's description.
I have a training data of sufficient size as well. I have around a million entities with the description and their correct labels, which should be sufficient I guess. I am new to Deep Learning and would like if I get a raw structure of the code which I can play with and learn simultaneously. I have some basic understanding of Tensorflow and Keras, but its difficult for me to proceed.
I am also aware that I can use Pre-trained embeddings on Wikipedia corpus, GloVe and Word2Vec. Any help would be great.
Example: From this Wikipedia page, I have extracted the following information:
Categories: Novels by Sue Grafton,Kinsey Millhone novels,2005 American novels,1953 in fiction,1987 in fiction,Novels set in California,G. P. Putnam's Sons books,2000s mystery novel stubs
Description: It is the 19th novel in Sue Grafton's Alphabet series of mystery novels and features Kinsey Millhone, a private eye based in Santa Teresa, California.
Based on which it should be classified as 'book' or 'novel'.
EDIT: Because I'm new to Tensorflow, what I want is a code that I can understand and change according to my needs. Some basic structure of a similar code from any resource will do.