I am attempting to classify a train set of texts to be used for predicting similar texts in the test set of texts. I am using the one_class_svm model. 'author_corpus' contains a list of texts written by a single author and 'test_corpus' contains a list of texts written by both other authors and the original author. I am attempting to use one_class_svm to identify the author in the test texts.
def analyse_corpus(author_corpus, test_corpus):
vectorizer = TfidfVectorizer()
author_vectors = vectorizer.fit_transform(author_corpus)
test_vectors = vectorizer.fit_transform(test_corpus)
model = OneClassSVM(gamma='auto')
model.fit(author_vectors)
test = model.predict(test_vectors)
I am getting the value error:
X.shape[1] = 2484 should be equal to 1478, the number of features at training time
How might I implement this model to accurately predict authorship of texts in the test set given the single author in the train set? Any help is appreciated.
For reference, here is a link to the one_class_svm model guide: https://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html#sklearn.svm.OneClassSVM