0

I have a DataFrame like:

     text_data                worker_dicts                  outcomes

0    "Some string"           {"Sector":"Finance",             0
                              "State: NJ"}                   
1    "Another string"        {"Sector":"Programming",         1
                              "State: NY"}                             

It has both text information, and a column that is a dictionary. (The real worker_dicts has many more fields). I'm interested in the binary outcome column.

What I initially tried doing was to combine both text_data and worker_dict, crudely concatenating both columns, and then running Multinomial NB on that:

    df['stacked_features']=df['text_data'].astype(str)+'_'+df['worker_dicts']
    stacked_features = np.array(df['stacked_features'])
    outcomes = np.array(df['outcomes'])
    text_clf = Pipeline([('vect', TfidfVectorizer(stop_words='english'), ngram_range = (1,3)), 
   ('clf', MultinomialNB())])
    text_clf = text_clf.fit(stacked_features, outcomes)

But I got very bad accuracy, and I think that fitting two independent models would be a better use of data than fitting one model on both types of features (as I am doing with stacking).

How would I go about utilizing Feature Union? worker_dicts is a little weird because it's a dictionary, so I'm very confused as to how I'd go about parsing that.

Grr
  • 15,553
  • 7
  • 65
  • 85

1 Answers1

0

If your dictionary entries are categorical as they appear to be in your example, then I would create different columns from the dictionary entries before doing additional processing.

new_features = pd.DataFrame(df['worker_dicts'].values.tolist())

Then new_features will be its own dataframe with columns Sector and State and you can one hot encode those as needed in addition to TFIDF or other feature extraction for your text_data column. In order to use that in a pipeline, you would need to create a new transformer class, so I might suggest just applying the dictionary parsing and the TFIDF separately, then stacking the results, and adding OneHotEncoding to your pipeline as that allows you to specify columns to apply the transformer to. (As the categories you want to encode are strings you may want to use LabelBinarizer class instead of OneHotEncoder class for the encoding transformation.)

If you want to just use TFIDF on all of the columns individually with a pipeline, you would need to use a nested Pipeline and FeatureUnion set up to extract columns as described here.

If you have your one hot encoded features in dataframes X1 and X2 as described below and your text features in X3, you could do something like the following to create a pipeline. (There are many other options, this is just one way)

X = pd.concat([X1, X2, X3], axis=1)

def select_text_data(X):
    return X['text_data']

def select_remaining_data(X):
    return X.drop('text_data', axis=1)


# pipeline to get all tfidf and word count for first column
text_pipeline = Pipeline([
    ('column_selection', FunctionTransformer(select_text_data, validate=False)),
    ('tfidf', TfidfVectorizer())
])


final_pipeline = Pipeline([('feature-union', FeatureUnion([('text-features', text_pipeline), 
                               ('other-features', FunctionTransformer(select_remaining_data))
                              ])),
                          ('clf', LogisticRegression())
                          ])

(MultinomialNB won't work in the pipeline because it doesn't have fit and fit_transform methods)

elz
  • 5,338
  • 3
  • 28
  • 30
  • This is a very informative answer! In the meanwhile, I ended up using Pandas to do some stuff: First, I turned each of the `worker_dicts` keys (`Sector`, `State`) into columns, and then I used one-hot encoding on each of those columns. Then, for both columns, I stacked all the sparse matrices that were outputs of one-hot encoding. Let `X1` and `X2` denote stacked sparse arrays comprising of the one-hot encoding of `Sector` and `State`, respectively. Let `X3` denote the vectorized `text_data` column. –  Dec 20 '17 at 22:46
  • Given the modifications to the data - where we now have `X1`, `X2`, `X3` - would you make any edits to the two methods you suggested? –  Dec 20 '17 at 23:03