8

I am trying to prepare data for supervised learning. I have my Tfidf data, which was generated from a column in my dataframe called "merged"

vect = TfidfVectorizer(stop_words='english', use_idf=True, min_df=50, ngram_range=(1,2))
X = vect.fit_transform(merged['kws_name_desc'])
print X.shape
print type(X)

(57629, 11947)
<class 'scipy.sparse.csr.csr_matrix'>

But I also need to add additional columns to this matrix. For each document in the TFIDF matrix, I have a list of additional numeric features. Each list is length 40 and it's comprised of floats.

So for clarify, I have 57,629 lists of length 40 which I'd like to append on to my TDIDF result.

Currently, I have this in a DataFrame, example data: merged["other_data"]. Below is an example row from the merged["other_data"]

0.4329597715,0.3637511039,0.4893141843,0.35840...   

How can I append the 57,629 rows of my dataframe column with the TF-IDF matrix? I honestly don't know where to begin and would appreciate any pointers/guidance.

jrjames83
  • 901
  • 2
  • 9
  • 22
  • Does this answer your question? [use Featureunion in scikit-learn to combine two pandas columns for tfidf](https://stackoverflow.com/questions/34710281/use-featureunion-in-scikit-learn-to-combine-two-pandas-columns-for-tfidf) – louis_guitton Apr 18 '20 at 21:53

3 Answers3

3

This will do the work.

`df1 = pd.DataFrame(X.toarray())   //Convert sparse matrix to array
 df2 = YOUR_DF of size 57k x 40

 newDf = pd.concat([df1, df2], axis = 1)`//newDf is the required dataframe
eshb
  • 196
  • 1
  • 9
1

I figured it out:

First: iterate over my pandas column and create a list of lists

for_np = []

for x in merged['other_data']:
    row = x.split(",")
    row2 = map(float, row)
    for_np.append(row2)

Then create a np array:

n = np.array(for_np)

Then use scipy.sparse.hstack on X (my original tfidf sparse matrix and my new matrix. I'll probably end-up reweighting these 40-d vectors if they do not improve the classification results, but this approach worked!

import scipy.sparse

X = scipy.sparse.hstack([X, n])
jrjames83
  • 901
  • 2
  • 9
  • 22
  • I am sure i looked around and overlook what I was missing trying to add a column. Somebody on another question made it clear but it just clicked on this one line above. –  Jun 28 '18 at 15:47
  • oups hit return, hstack(X_train_tfidf,X_shp) did not work but hstack([X_train_tfidf,X_shp]) did work and the difference is in the [ ]. –  Jun 28 '18 at 15:48
  • That was really an interesting question and solution. Can you add some idea on if you scaled the extra columns or used them as it is? – lu5er Jan 22 '19 at 13:38
  • @IU5er - if I recall correctly, I experimented with various weightings for the tfidf features, but they did not improve my results much, so I dropped them. I could have perhaps appended a PCA'd down version of the results but the outcome likely would have been the same. I think combining NLP style features with more generic features is still a pretty open ended issue/problem. In a more recent problem, I've created binary features, based on whether or not a training observation contains a word, or contains one of many words, thus avoiding tons of new features. – jrjames83 Jan 22 '19 at 15:27
0

You could have a look at the answer to this question:

use Featureunion in scikit-learn to combine two pandas columns for tfidf

Obviously, the anwers given should work, but as soon as you want your classifier to make predictions, you definitely want to work with pipelines and feature unions.

Community
  • 1
  • 1
thomi
  • 1,603
  • 1
  • 24
  • 31