Incorporating additional numeric features into text classification model

Question

I modified some Python code from github to run logistic regression on a subset of consumer complaints data using the following code, the text vectorization and classification parts work smoothly. But I am wondering if it's possible to also include non-text, binary numerical indicators, such as timely_response and consumer_disputed. as features (alongside text vectors)? However, when I did this, Python returns an error saying that I have input variables with inconsistent numbers of samples.

%% load packages and data
import logging
import pandas as pd
import numpy as np
from numpy import random
import gensim
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
import re
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
STOPWORDS = set(stopwords.words('english'))
from bs4 import BeautifulSoup
from IPython.core.interactiveshell import InteractiveShell

df = pd.read_csv('https://www.dropbox.com/s/obbs000w7knjmys/example_complaints.csv?dl=1')
df = df[pd.notnull(df['consumer_complaint_narrative'])]

df['product'].value_counts()

%% cleaning text

REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def clean_text(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = BeautifulSoup(text, "lxml").text # HTML decoding
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = BAD_SYMBOLS_RE.sub('', text) # delete symbols which are in BAD_SYMBOLS_RE from text
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # delete stopwors from text
    return text
    
df['consumer_complaint_narrative'] = df['consumer_complaint_narrative'].apply(clean_text)

%% include only text as features

X = df['consumer_complaint_narrative']
y = df['product']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 42)

%% fit and test with logistic regression

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import classification_report

logreg = Pipeline([('vect', CountVectorizer()),
                ('tfidf', TfidfTransformer()),
                ('clf', LogisticRegression(n_jobs=1, C=1e5)),
               ])
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
my_tags = ['Debt collection','Mortgage','Credit reporting','Credit card','Bank account or service','Consumer Loan','Student loan','Payday loan','Money transfers','Other financial service','Prepaid card']

print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=my_tags))


%% including binary numerical indicators as additional features

new_X = df[['consumer_complaint_narrative', 'timely_response', 'consumer_disputed.']]
y = df['product']
X_train, X_test, y_train, y_test = train_test_split(new_X, y, test_size=0.25, random_state = 42)

%% fit and test again

logreg = Pipeline([('vect', CountVectorizer()),
                ('tfidf', TfidfTransformer()),
                ('clf', LogisticRegression(n_jobs=1, C=1e5)),
               ])
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=my_tags))

which returns the following error message

    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-28-455c8fd83ba4> in <module>
          8                 ('clf', LogisticRegression(n_jobs=1, C=1e5)),
          9                ])
    ---> 10 logreg.fit(X_train, y_train)
         11 
         12 y_pred = logreg.predict(X_test)

    ~\Anaconda3\lib\site-packages\sklearn\pipeline.py in fit(self, X, y, **fit_params)
        265         Xt, fit_params = self._fit(X, y, **fit_params)
        266         if self._final_estimator is not None:
    --> 267             self._final_estimator.fit(Xt, y, **fit_params)
        268         return self
        269 

    ~\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py in fit(self, X, y, sample_weight)
       1286 
       1287         X, y = check_X_y(X, y, accept_sparse='csr', dtype=_dtype, order="C",
    -> 1288                          accept_large_sparse=solver != 'liblinear')
       1289         check_classification_targets(y)
       1290         self.classes_ = np.unique(y)

    ~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
        764         y = y.astype(np.float64)
        765 
    --> 766     check_consistent_length(X, y)
        767 
        768     return X, y

    ~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_consistent_length(*arrays)
        233     if len(uniques) > 1:
        234         raise ValueError("Found input variables with inconsistent numbers of"
    --> 235                          " samples: %r" % [int(l) for l in lengths])
        236 
        237 

    ValueError: Found input variables with inconsistent numbers of samples: [3, 529]

Will be really grateful if someone could shed some lights on this.

jawsem · Answer 1 · 2020-02-06T13:10:14.760

This has to do with fitting the pipeline with multiple steps.

The CountVectorizer() and TfidfTransformer() are meant for text data, but the additional fields you are adding are not text data. (this is the part i am talking about below.)


logreg = Pipeline([('vect', CountVectorizer()),
                ('tfidf', TfidfTransformer()),
                ('clf', LogisticRegression(n_jobs=1, C=1e5))

There are some other threads about adding in other pandas columns to text data however that seems somewhat cumbersome to me. (reference Adding pandas columns to a sparse matrix).

Another option to integrate the additional data is simply to model your logistic regression using the text data and use the output of that regression in another model with your additional features.

Here is a way you could do that.

X = df['consumer_complaint_narrative']
y = df['product']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 42)

%% fit and test with logistic regression

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import classification_report

logreg = Pipeline([('vect', CountVectorizer()),
                ('tfidf', TfidfTransformer()),
                ('clf', LogisticRegression(n_jobs=1, C=1e5)),
               ])
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
## get the output of your model for the new model
add_features = pd.DataFrame(logreg.predict_proba(X))

print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=my_tags))


%% including binary numerical indicators as additional features

new_X = pd.concat([add_features,df[['timely_response', 'consumer_disputed.']]],axis=1)
y = df['product']
X_train, X_test, y_train, y_test = train_test_split(new_X, y, test_size=0.25, random_state = 42)

%% fit and test again
## Do not need pipeline can fit a single logistic regression
logreg = LogisticRegression(n_jobs=1, C=1e5)
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=my_tags))

I am trying to think if there could be some sort of over-fitting issue here, but I do not think so. Since you are using the same random_state your split should be exactly the same as before, so we should be good.

After reviewing your comment, I changed my code to turn the add_features into a dataframe so they can concat properly.

I think the other issue you are experiencing is because there are inconsistent indexes in your dataframe.

An easy way to fix this is to simply reset your index in the very beginning of your script.

Look at the first 2 lines where you get your dataframe.

df = pd.read_csv('https://www.dropbox.com/s/obbs000w7knjmys/example_complaints.csv?dl=1')
## add reset_index here
df = df[pd.notnull(df['consumer_complaint_narrative'])].reset_index()

Hi, thanks for your reply. I got an error message saying that `TypeError: cannot concatenate object of type ""; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid`, obviously this has to do with the `add_features` object. — Chris T., Feb 06 '20 at 05:01
Yet after I transformed `add_features` into `dataframe` object using `pd.DataFrame()` and added it back into `pd.concat()`, then replicated you code. Python returns the same error message: `Found input variables with inconsistent numbers of samples: [1318, 706]` — Chris T., Feb 06 '20 at 05:21
@ChrisT. I have made some edits. The reason for the second error is because of inconsistent indexes between the add_features and your other data. Resetting the index in the beginning should fix it. — jawsem, Feb 06 '20 at 13:11
thanks for the head up. I tend to think that it's because the `pd.concat()` function is not working as expected. Instead of concatenating two dataframe "side by side," it "stacks" NaN rows on the bottom of the expanded new dataframe, causing the row length of `X` to be larger than `y`, hence, the error. — Chris T., Feb 06 '20 at 13:14
@ChrisT. You are right the issue is with pd.concat. The way pd.concat works is it will concatenate the values on the there index. Its not really anything wrong with pd.concat. It is doing what it is supposed to do. The issue is that when you make the prediction with predict_proba it no longer has the original index from the dataframe. The issue could also be solved by doing the following when we get our add_features. ```add_features = pd.DataFrame(logreg.predict_proba(X),index=X.index) ``` — jawsem, Feb 06 '20 at 16:14

Incorporating additional numeric features into text classification model

1 Answers1