Classifying multiple documents based on a single page

Question

I am trying to find the best method to classify a single page, based on training a classifier/classifiers on a number of unique documents(let's say 350). Each of these documents can be 1 to n pages long. There may be 10 to 1000 samples available to train the classifier on for each document. The problem is, I have a random page from these documents(the testing set is different from the training set), I want to classify the page as one of those documents. I had previously tried building a single classifier and got good results but the test set was whole documents, now it is pages from those documents.

I have already formed a partial solution which I am describing below though I would appreciate suggestions to better it and complete it.

Getting a list of all files using scandir.
Creating a object list of each file with attributes being path, sample_num, doc_type, json_data, preprocessed_data.
Preprocessing all the json data by removing punctuation, numeric values, stopwords, stemming and returning a list of words along with their frequencies.
Creating a feature list for each unique document, this list is primarily the 'n' most frequent words that occur in that document.

Splitting the data into 70:30 for training and testing, each document has 70% docs in the training set and 30% docs in the testing set. (Splitting a list of file names in a predefined ratio)

#all_doc_types is a dictionary with unique documents and their total counts as key-value pairs
def split_data(sampleObjList):
    trainingData = []
    testData = []
    temp_dict = {}

    for x in sampleObjList:
        ratio = int(0.7*all_doc_types[x.doc_type])+1
        if x.doc_type not in temp_dict:
            temp_dict[x.doc_type] = 1
            trainingData.append(x)
        else:
            temp_dict[x.doc_type] += 1
            if(temp_dict[x.doc_type] < ratio):
                trainingData.append(x)
            else:
                testData.append(x)

return trainingData, testData

Formatting the training data for the fit function, forming a list of 1's and 0's corresponding to each document in the directory. This list is made after the words of a particular document are matched to its corresponding feature list. If a in the feature list occurs in the document its 1 and vice versa.
Similarly format testing data for predict function.

Training unique models corresponding the each document

classifier = GaussianNB()
for docType in trainingData:
    fit = classifier.fit(trainingData[docType], [docType]*len(trainingData[docType]))
    filename = 'NB_classifier_'+docType+'.pickle'
    f = open(filename, 'wb')
    pickle.dump(classifier, f)
    f.close()

Testing the test data comprised of single pages chosen randomly from the given unique documents.

I can easily run the test data set through each of the models that I have trained on but I would have no way to know which is the document that I classify that test page as because I would have more than one model return a positive result. Hence I am looking to combine those Naive Bayes models together instead of being separate binary classifiers OR finding a way to get the best result as output from the resulting solution set I get from the multiple binary classifiers.

I searched before asking and these are the closest results -

I could not figure out using both of these for my specific problem.

Any help would be greatly appreciated.

NOTE: I had a bad experience earlier where people downvoted my question without giving a reason to do so, it so happens that the question was Splitting a list of file names in a predefined ratio. It would be really useful if I could be given feedback for any such downvote after all we are all learning everyday.

Well the naive approach is extracting probabilities from your classifiers and build a geometric mean to combine those. But yould be able to get this all for free, if you just use one classifier and multi-class classifiers (which i recommend). The other thing is the classifier itself: i would think, that a (strongly regularized) nonlinear-kernel SVM should be better (but that's a feeling). The more complex approaches to combine different classifiers are called [Stacking](https://en.wikipedia.org/wiki/Ensemble_learning#Stacking).But often these are not single-class at the core like your approach — sascha, Sep 06 '16 at 23:24
Thank you for the answer @sascha, I tried the one classifier approach(multi-class classifier, trained on all documents), I got poor results though I admit I had less training and test data at that time. I do not want to move on to SVM before exhausting all possibilities with the Naive Bayes approach. Do you have any examples of Stacking with NB, I only found with other classifiers(mostly multi-class as you pointed)? Also could you shed more light on when you say building a geometric mean to combine those probabilities? — Shivansh Singh, Sep 07 '16 at 15:19

Classifying multiple documents based on a single page

0 Answers0