1

I am trying to find the best method to classify a single page, based on training a classifier/classifiers on a number of unique documents(let's say 350). Each of these documents can be 1 to n pages long. There may be 10 to 1000 samples available to train the classifier on for each document. The problem is, I have a random page from these documents(the testing set is different from the training set), I want to classify the page as one of those documents. I had previously tried building a single classifier and got good results but the test set was whole documents, now it is pages from those documents.

I have already formed a partial solution which I am describing below though I would appreciate suggestions to better it and complete it.

  1. Getting a list of all files using scandir.
  2. Creating a object list of each file with attributes being path, sample_num, doc_type, json_data, preprocessed_data.
  3. Preprocessing all the json data by removing punctuation, numeric values, stopwords, stemming and returning a list of words along with their frequencies.
  4. Creating a feature list for each unique document, this list is primarily the 'n' most frequent words that occur in that document.
  5. Splitting the data into 70:30 for training and testing, each document has 70% docs in the training set and 30% docs in the testing set. (Splitting a list of file names in a predefined ratio)

    #all_doc_types is a dictionary with unique documents and their total counts as key-value pairs
    def split_data(sampleObjList):
        trainingData = []
        testData = []
        temp_dict = {}
    
        for x in sampleObjList:
            ratio = int(0.7*all_doc_types[x.doc_type])+1
            if x.doc_type not in temp_dict:
                temp_dict[x.doc_type] = 1
                trainingData.append(x)
            else:
                temp_dict[x.doc_type] += 1
                if(temp_dict[x.doc_type] < ratio):
                    trainingData.append(x)
                else:
                    testData.append(x)
    
    return trainingData, testData 
    
  6. Formatting the training data for the fit function, forming a list of 1's and 0's corresponding to each document in the directory. This list is made after the words of a particular document are matched to its corresponding feature list. If a in the feature list occurs in the document its 1 and vice versa.
  7. Similarly format testing data for predict function.
  8. Training unique models corresponding the each document

    classifier = GaussianNB()
    for docType in trainingData:
        fit = classifier.fit(trainingData[docType], [docType]*len(trainingData[docType]))
        filename = 'NB_classifier_'+docType+'.pickle'
        f = open(filename, 'wb')
        pickle.dump(classifier, f)
        f.close()
    
  9. Testing the test data comprised of single pages chosen randomly from the given unique documents.

I can easily run the test data set through each of the models that I have trained on but I would have no way to know which is the document that I classify that test page as because I would have more than one model return a positive result. Hence I am looking to combine those Naive Bayes models together instead of being separate binary classifiers OR finding a way to get the best result as output from the resulting solution set I get from the multiple binary classifiers.

I searched before asking and these are the closest results -

I could not figure out using both of these for my specific problem.

Any help would be greatly appreciated.

NOTE: I had a bad experience earlier where people downvoted my question without giving a reason to do so, it so happens that the question was Splitting a list of file names in a predefined ratio. It would be really useful if I could be given feedback for any such downvote after all we are all learning everyday.

Community
  • 1
  • 1
  • Well the naive approach is extracting probabilities from your classifiers and build a geometric mean to combine those. But yould be able to get this all for free, if you just use one classifier and multi-class classifiers (which i recommend). The other thing is the classifier itself: i would think, that a (strongly regularized) nonlinear-kernel SVM should be better (but that's a feeling). The more complex approaches to combine different classifiers are called [Stacking](https://en.wikipedia.org/wiki/Ensemble_learning#Stacking).But often these are not single-class at the core like your approach – sascha Sep 06 '16 at 23:24
  • Thank you for the answer @sascha, I tried the one classifier approach(multi-class classifier, trained on all documents), I got poor results though I admit I had less training and test data at that time. I do not want to move on to SVM before exhausting all possibilities with the Naive Bayes approach. Do you have any examples of Stacking with NB, I only found with other classifiers(mostly multi-class as you pointed)? Also could you shed more light on when you say building a geometric mean to combine those probabilities? – Shivansh Singh Sep 07 '16 at 15:19

0 Answers0