How can I reference additional fields from a Pandas DataFrame when vectorizing text documents in scikit-learn?

Question

I'm building a supervised learning application using scikit-learn. My input data comes from a table. The text document is essentially one column ('description') of this table, but I think I can improve my accuracy by referencing other columns in the vectorization (for example to remove street addresses from the description field).

I think I can do this by specifying my own preprocessor and tokenizer functions when I construct the vectorizer. However, I run into trouble with the vectorizer's fit() method. I'm trying to pass a DataFrame containing my columns as the raw_documents.

When the raw_documents get to CountVectorizer._count_vocab() method to build the vocabulary, the code iterates through each record using "for doc in raw_documents:". I was expecting this to walk through each row in the DataFrame and provide a Pandas Series containing that record as the "doc". This "doc" would then get passed to the analyzer and then to my preprocessor and tokenizer where I could reference the associated fields in the Series by name.

Unfortunately, the default behavior for DataFrame is that iter() iterates along the information axis instead of the index axis. This means my vectorizer is now walking along the list of column headings instead of each record row (as a Pandas Series). The data that gets to the analyzer as the "doc" is just the column heading strings.

This simple example shows what I am trying to do. the preprocessor is dumb and incomplete, but shows how I am trying to access the adjacent field on a record. (I could also use direction on how to properly update the description value on the input to avoid the Pandas "SettingWithCopyWarning" problem. I tried to follow the recommendation to use .loc[], but still get the warning.)

import re
from io import StringIO
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

def my_preprocessor(record):
    try:
        description = record.loc['description']
        # try to update the description field on this record by removing the street address
        description = re.sub(record['street'], '', description.lower())
        # need help here with SettingWithCopyWarning                           
        record.loc['description'] = description
        return record
    except:
        return record.lower()

data = StringIO('''"id","street","description","label_1","label_2"
    "2341324","123 Elm Street","Pine Point was situated at 123 Elm Street in Boston.",1,1''')

df = pd.read_csv(data)

vect = CountVectorizer(preprocessor=my_preprocessor)
vect.fit_transform(df)
print(vect.vocabulary_)

Results in the column headings as my vocabulary.

{'id': 1, 'street': 4, 'description': 0, 'label_1': 2, 'label_2': 3}

I looked at a couple of options:

Wrap my input data in a DataFrame subclass (RowIterableDataFrame) that overrides iter() with a row-wise iterator implementation. I can make the iterator work, but scikit-learn GridSearchCV does a bunch of slicing of the input data so the RowIterableDataFrame I pass in has been sliced into a subset of data rows as a regular DataFrame again by the time it gets into the _count_vocab() method with the "for doc in raw_documents:"
Pass in the records using DataFrame's iterrows() or itertuples() methods. This gets the right data in on a row-by-row basis, but fails the check_consistent_length() test when the fit() methods call indexable().
Subclass CountVectorizer and write my own version of the _count_vocab() method that iterates through raw_documents differently in the case of DataFrames (i.e. using .iloc[] indexing). I'd rather not do this because _count_vocab() does a bunch of other stuff I don't want to risk breaking.
Pre-process my records outside of scikit learn to build a delimited string as input. Pass a list of these in as the raw_documents and then parse them in my preprocessor. This means extra passes through the data.
Pass in the records using DataFrame.to_dict(orient='records'). This gets me the right data on a row-by-row basis and keeps the column indices for referencing by my preprocessor and tokenizer. The downside appears to be that I have to copy out the data for each row into that dictionary instead of referencing the original data in the DataFrame as a Series.

I would welcome some guidance on how to do this. Perhaps changing the iteration behavior of a Pandas DataFrame or the simplest approach to extending the CountVectorizer.

Can you make a simple, but reproducible example of a data set and provide desired data set? Please read [how to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — MaxU - stand with Ukraine, Feb 21 '17 at 21:32
I have a workaround using the first approach above (subclassing DataFrame as RowIterableDataFrame and overriding __iter__() with my DataFrameRowIterator. The issue i ran into the first time was that I didn't understand the caveats about specifying the _constructor properties when subclassing DataFrame. Now a slice of a RowIterableDataFrame continues as a RowIterableDataFrame. So that works, but would still welcome guidance on a better way to approach this whole scenario. — Dig_Doug, Feb 22 '17 at 18:57
Also realized the SettingWithCopyWarning was probably pointing me in the right direction. The preprocessor now makes a copy of the input Series document and updates the output. I don't want to change the value in the source DataFrame, just change it in the data getting passed along to the tokenizer and everything downstream for this run. The copy gets thrown away when the tokenizer and analyzer complete their work. — Dig_Doug, Feb 22 '17 at 19:29

How can I reference additional fields from a Pandas DataFrame when vectorizing text documents in scikit-learn?

0 Answers0