Training a sklearn classifier with more than a single feature

Question

I'm currently training a LinearSVC classifier with a single feature vectorizer. I'm processing news, which are stored in separated files. Those files originally had a title, a textual body, a date, an author and sometimes an image. But I ended up removing everythong but the textual body as a feature. I'm doing it this way:

# Loading the files (Plain files with just the news content. Nor date, author or other features.)

data_train = load_files(self.TRAIN_FOLDER, encoding=self.ENCODING)  # data_train
data_test = load_files(self.TEST_FOLDER, encoding=self.ENCODING)
unlabeled = load_files(self.UNLABELED_FOLDER, encoding=self.ENCODING)
categories = data_train.target_names

# Get the sparse matrix of each dataset
y_train = data_train.target
y_test = data_test.target

# Vectorizing 
vectorizer = TfidfVectorizer(encoding=self.ENCODING, use_idf=True, norm='l2', binary=False, sublinear_tf=True, min_df=0.001, max_df=1.0, ngram_range=(1, 2), analyzer='word')

X_train = vectorizer.fit_transform(data_train.data)
X_test = vectorizer.transform(data_test.data)
X_unlabeled = vectorizer.transform(self.data_unlabeled.data)

# Instantiating the classifier
clf = LinearSVC(loss='squared_hinge', penalty='l2', dual=False, tol=1e-3)

# Fitting the model according to the training set and predicting
scaler = preprocessing.StandardScaler(with_mean=False) 
scaler = scaler.fit(X_train) 

normalized_X_train = scaler.transform(X_train) 
clf.fit(normalized_X_train, y_train) 

normalized_X_test = scaler.transform(X_test) 
pred = clf.predict(normalized_X_test)

accuracy_score = metrics.accuracy_score(y_test, pred)
recall_score = metrics.recall_score(y_test, pred)
precision_score = metrics.precision_score(y_test, pred)

But now I would like to include other features, as the date or the author, and all the simpler examples I found are using a single feature. So I'm not really sure how to proceed. Should I have all the information in a single file? How to diferentiate authors from content? Should I use a vectorizer for each feature? If so, should I fit a model with different vectorized features? Or should I have a different classifier for each feature? Can you suggest me something to read (explained to newbies)?

Thanks in advance,

score 4 · Accepted Answer · edited Jun 05 '19 at 17:11

4

The output of TfidfVectorizer is a scipy.sparse.csr.csr_matrix object. You may use hstack to add more features (like here). Alternatively, you may convert the feature space you already have above to a numpy array or pandas df and then add the new features (which you might have created from other vectorizers) as new columns to it. Either way, your final X_train and X_test should include all the features in one place. You may also need to standardize them before doing the training (here). You do not seem to be doing that here.

I do not have your data so here is an example on some dummy data:

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(corpus)

X_train = pd.DataFrame(X_train.todense())

X_train['has_image'] = [1, 0, 0, 1]  # just adding a dummy feature for demonstration

edited Jun 05 '19 at 17:11

desertnaut

57,590
26
140
166

answered Jun 05 '19 at 16:10

Reveille

4,359
3
23
46

1

Thanks a lot for all the info and links!! – gal007 Jun 06 '19 at 08:09
regarding normalization, should I add the following in my code? data = scale(X_train) clf.fit(data, y_train) pred = clf.predict(data) (I updated the code in the question since it is easier to read it there). Thanks for pointing me out this lack! – gal007 Jun 06 '19 at 09:13
1

Yours does not appear to be ok. Scaler should be fit to the X_train and then this fitted scaler is applied to both X_train and X_test. I have actually explained that in another post [here](https://stackoverflow.com/questions/53127278/should-i-normalize-training-and-test-test-separately-after-shuffling-and-splitti/53136595#53136595). Demonstration of various scalers can also be found [here](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py) – Reveille Jun 06 '19 at 12:39
1

Thanks a LOT! I updated the code in case it helps someone else. I end up using a StandardScaler(with_mean=False), since the data is huge and otherwise it throws an error. And I also standarized the unlabeled set, to later use it in the decisioin function. Thanks!! – gal007 Jun 06 '19 at 13:33
1

Awesome. my pleasure. – Reveille Jun 06 '19 at 14:12

Training a sklearn classifier with more than a single feature

1 Answers1