I am trying to find the best method to classify a single page, based on training a classifier/classifiers on a number of unique documents(let's say 350). Each of these documents can be 1 to n pages long. There may be 10 to 1000 samples available to train the classifier on for each document. The problem is, I have a random page from these documents(the testing set is different from the training set), I want to classify the page as one of those documents. I had previously tried building a single classifier and got good results but the test set was whole documents, now it is pages from those documents.
I have already formed a partial solution which I am describing below though I would appreciate suggestions to better it and complete it.
- Getting a list of all files using scandir.
- Creating a object list of each file with attributes being path, sample_num, doc_type, json_data, preprocessed_data.
- Preprocessing all the json data by removing punctuation, numeric values, stopwords, stemming and returning a list of words along with their frequencies.
- Creating a feature list for each unique document, this list is primarily the 'n' most frequent words that occur in that document.
Splitting the data into 70:30 for training and testing, each document has 70% docs in the training set and 30% docs in the testing set. (Splitting a list of file names in a predefined ratio)
#all_doc_types is a dictionary with unique documents and their total counts as key-value pairs def split_data(sampleObjList): trainingData = [] testData = [] temp_dict = {} for x in sampleObjList: ratio = int(0.7*all_doc_types[x.doc_type])+1 if x.doc_type not in temp_dict: temp_dict[x.doc_type] = 1 trainingData.append(x) else: temp_dict[x.doc_type] += 1 if(temp_dict[x.doc_type] < ratio): trainingData.append(x) else: testData.append(x) return trainingData, testData
- Formatting the training data for the fit function, forming a list of 1's and 0's corresponding to each document in the directory. This list is made after the words of a particular document are matched to its corresponding feature list. If a in the feature list occurs in the document its 1 and vice versa.
- Similarly format testing data for predict function.
Training unique models corresponding the each document
classifier = GaussianNB() for docType in trainingData: fit = classifier.fit(trainingData[docType], [docType]*len(trainingData[docType])) filename = 'NB_classifier_'+docType+'.pickle' f = open(filename, 'wb') pickle.dump(classifier, f) f.close()
Testing the test data comprised of single pages chosen randomly from the given unique documents.
I can easily run the test data set through each of the models that I have trained on but I would have no way to know which is the document that I classify that test page as because I would have more than one model return a positive result. Hence I am looking to combine those Naive Bayes models together instead of being separate binary classifiers OR finding a way to get the best result as output from the resulting solution set I get from the multiple binary classifiers.
I searched before asking and these are the closest results -
I could not figure out using both of these for my specific problem.
Any help would be greatly appreciated.
NOTE: I had a bad experience earlier where people downvoted my question without giving a reason to do so, it so happens that the question was Splitting a list of file names in a predefined ratio. It would be really useful if I could be given feedback for any such downvote after all we are all learning everyday.