5

I am attempting to classify a train set of texts to be used for predicting similar texts in the test set of texts. I am using the one_class_svm model. 'author_corpus' contains a list of texts written by a single author and 'test_corpus' contains a list of texts written by both other authors and the original author. I am attempting to use one_class_svm to identify the author in the test texts.

def analyse_corpus(author_corpus, test_corpus):

    vectorizer = TfidfVectorizer()

    author_vectors = vectorizer.fit_transform(author_corpus)
    test_vectors = vectorizer.fit_transform(test_corpus)

    model = OneClassSVM(gamma='auto')

    model.fit(author_vectors)

    test = model.predict(test_vectors)

I am getting the value error:

X.shape[1] = 2484 should be equal to 1478, the number of features at training time

How might I implement this model to accurately predict authorship of texts in the test set given the single author in the train set? Any help is appreciated.

For reference, here is a link to the one_class_svm model guide: https://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html#sklearn.svm.OneClassSVM

MythKhan
  • 121
  • 7

1 Answers1

5

You should fit (train) the model on the train data and make the predictions using the trained model on the test data.

  • fit: fit (trains) the model
  • fit_transform: fits the model and then makes the predictions
  • transform : Makes the predicitons

The mistake you are doing is

test_vectors = vectorizer.fit_transform(test_corpus)

Sample usage

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

train = fetch_20newsgroups(subset='train', categories=['alt.atheism'], shuffle=True, random_state=42).data
test =  fetch_20newsgroups(subset='train', categories=['alt.atheism', 'soc.religion.christian'], shuffle=True, random_state=42).data

vectorizer = TfidfVectorizer()
train_vectors = vectorizer.fit_transform(train)
test_vectors = vectorizer.transform(test)

model = OneClassSVM(gamma='auto')
model.fit(train_vectors)

test_predictions = model.predict(test_vectors)
mujjiga
  • 16,186
  • 2
  • 33
  • 51
  • Hi, thanks for the reply. It works this time but how would I go about printing the texts in the test set that are written by the same author? – MythKhan Feb 29 '20 at 06:50
  • I am also getting an issue where every prediction results in a -1 (an outlier) even if the text is very similar. What can I do to improve the accuracy? – MythKhan Feb 29 '20 at 07:20
  • Before converting to TFIDF, cleanup the text, remove stopwords, try stemming, and mean normalize the TFIDF vectors before training the model. – mujjiga Feb 29 '20 at 07:41