I'm currently training a LinearSVC classifier with a single feature vectorizer. I'm processing news, which are stored in separated files. Those files originally had a title, a textual body, a date, an author and sometimes an image. But I ended up removing everythong but the textual body as a feature. I'm doing it this way:
# Loading the files (Plain files with just the news content. Nor date, author or other features.)
data_train = load_files(self.TRAIN_FOLDER, encoding=self.ENCODING) # data_train
data_test = load_files(self.TEST_FOLDER, encoding=self.ENCODING)
unlabeled = load_files(self.UNLABELED_FOLDER, encoding=self.ENCODING)
categories = data_train.target_names
# Get the sparse matrix of each dataset
y_train = data_train.target
y_test = data_test.target
# Vectorizing
vectorizer = TfidfVectorizer(encoding=self.ENCODING, use_idf=True, norm='l2', binary=False, sublinear_tf=True, min_df=0.001, max_df=1.0, ngram_range=(1, 2), analyzer='word')
X_train = vectorizer.fit_transform(data_train.data)
X_test = vectorizer.transform(data_test.data)
X_unlabeled = vectorizer.transform(self.data_unlabeled.data)
# Instantiating the classifier
clf = LinearSVC(loss='squared_hinge', penalty='l2', dual=False, tol=1e-3)
# Fitting the model according to the training set and predicting
scaler = preprocessing.StandardScaler(with_mean=False)
scaler = scaler.fit(X_train)
normalized_X_train = scaler.transform(X_train)
clf.fit(normalized_X_train, y_train)
normalized_X_test = scaler.transform(X_test)
pred = clf.predict(normalized_X_test)
accuracy_score = metrics.accuracy_score(y_test, pred)
recall_score = metrics.recall_score(y_test, pred)
precision_score = metrics.precision_score(y_test, pred)
But now I would like to include other features, as the date or the author, and all the simpler examples I found are using a single feature. So I'm not really sure how to proceed. Should I have all the information in a single file? How to diferentiate authors from content? Should I use a vectorizer for each feature? If so, should I fit a model with different vectorized features? Or should I have a different classifier for each feature? Can you suggest me something to read (explained to newbies)?
Thanks in advance,