14

I have the following pandas structure:

col1 col2 col3 text
1    1    0    meaningful text
5    9    7    trees
7    8    2    text

I'd like to vectorise it using a tfidf vectoriser. This, however, returns a parse matrix, which I can actually turn into a dense matrix via mysparsematrix).toarray(). However, how can I add this info with labels to my original df? So the target would look like:

col1 col2 col3 meaningful text trees
1    1    0    1          1    0
5    9    7    0          0    1
7    8    2    0          1    0

UPDATE:

Solution makes the concatenation wrong even when renaming original columns: enter image description here Dropping columns with at least one NaN results in only 7 rows left, even though I use fillna(0) before starting to work with it.

lte__
  • 7,175
  • 25
  • 74
  • 131

3 Answers3

34

You can proceed as follows:

Load data into a dataframe:

import pandas as pd

df = pd.read_table("/tmp/test.csv", sep="\s+")
print(df)

Output:

   col1  col2  col3             text
0     1     1     0  meaningful text
1     5     9     7            trees
2     7     8     2             text

Tokenize the text column using: sklearn.feature_extraction.text.TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

v = TfidfVectorizer()
x = v.fit_transform(df['text'])

Convert the tokenized data into a dataframe:

df1 = pd.DataFrame(x.toarray(), columns=v.get_feature_names())
print(df1)

Output:

   meaningful      text  trees
0    0.795961  0.605349    0.0
1    0.000000  0.000000    1.0
2    0.000000  1.000000    0.0

Concatenate the tokenization dataframe to the orignal one:

res = pd.concat([df, df1], axis=1)
print(res)

Output:

   col1  col2  col3             text  meaningful      text  trees
0     1     1     0  meaningful text    0.795961  0.605349    0.0
1     5     9     7            trees    0.000000  0.000000    1.0
2     7     8     2             text    0.000000  1.000000    0.0

If you want to drop the column text, you need to do that before the concatenation:

df.drop('text', axis=1, inplace=True)
res = pd.concat([df, df1], axis=1)
print(res)

Output:

   col1  col2  col3  meaningful      text  trees
0     1     1     0    0.795961  0.605349    0.0
1     5     9     7    0.000000  0.000000    1.0
2     7     8     2    0.000000  1.000000    0.0

Here's the full code:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.read_table("/tmp/test.csv", sep="\s+")
v = TfidfVectorizer()
x = v.fit_transform(df['text'])

df1 = pd.DataFrame(x.toarray(), columns=v.get_feature_names())
df.drop('text', axis=1, inplace=True)
res = pd.concat([df, df1], axis=1)
Mohamed Ali JAMAOUI
  • 14,275
  • 14
  • 73
  • 117
  • 3
    This almost works, but something goes wrong... At default, this performs an outer join, and I end up with 699 rows instead of the original 353, with a lot of NaN rows... What might be wrong? – lte__ Aug 30 '17 at 14:11
  • @lte__ can you share a dataset I can use to reproduce the problem ? – Mohamed Ali JAMAOUI Aug 30 '17 at 14:15
  • no, it's confidential data... I think some of the words in the text are the same as the labels, and this results in the outer join behaviour (just like the first example here https://pandas.pydata.org/pandas-docs/stable/merging.html#set-logic-on-the-other-axes ) – lte__ Aug 30 '17 at 14:20
  • @lte__ I suggest that you add a prefix to all the columns names in the original data then do the transformation. (something_col1, something_col2, ..) – Mohamed Ali JAMAOUI Aug 30 '17 at 14:25
  • Hm it's still the same for some reason... I'll update the question with a screenshot. – lte__ Aug 30 '17 at 14:32
  • @lte__ what was the issue? – Mohamed Ali JAMAOUI Aug 30 '17 at 16:34
  • 1
    I don't know, we solved it with a workaround on another question. But for this, your solution is actually the right answer. – lte__ Aug 30 '17 at 16:35
  • @MohamedAliJAMAOUI is there any way to change to DataFrame other than using `.toarray()` because It returns `Memory error`, Appretiate if you've got another idea to solve it. Thank you. –  Nov 06 '20 at 07:13
4

You can try the following -

import numpy as np 
import pandas as pd 
from sklearn.feature_extraction.text import TfidfVectorizer

# create some data
col1 = np.asarray(np.random.choice(10,size=(10)))
col2 = np.asarray(np.random.choice(10,size=(10)))
col3 = np.asarray(np.random.choice(10,size=(10)))
text = ['Some models allow for specialized',
         'efficient parameter search strategies,',
         'outlined below. Two generic approaches',
         'to sampling search candidates are ',
         'provided in scikit-learn: for given values,',
         'GridSearchCV exhaustively considers all',
         'parameter combinations, while RandomizedSearchCV',
         'can sample a given number of candidates',
         ' from a parameter space with a specified distribution.',
         ' After describing these tools we detail best practice applicable to both approaches.']

# create a dataframe from the the created data
df = pd.DataFrame([col1,col2,col3,text]).T
# set column names
df.columns=['col1','col2','col3','text']

tfidf_vec = TfidfVectorizer()
tfidf_dense = tfidf_vec.fit_transform(df['text']).todense()
new_cols = tfidf_vec.get_feature_names()

# remove the text column as the word 'text' may exist in the words and you'll get an error
df = df.drop('text',axis=1)
# join the tfidf values to the existing dataframe
df = df.join(pd.DataFrame(tfidf_dense, columns=new_cols))
Clock Slave
  • 7,627
  • 15
  • 68
  • 109
3

I would like to add some information to the accepted answer.

Before concatenating the two DataFrames (i.e. main DataFrame and TF-IDF DataFrame), make sure that the indices between the two DataFrames are similar. For instance, you can use df.reset_index(drop=True, inplace=True) to reset the DataFrame index.

Otherwise, your concatenated DataFrames will contain a lot of NaN rows. Having looked at the comments, this is probably what the OP experienced.

Glorian
  • 127
  • 1
  • 1
  • 10
  • I had the same problem as lte_ that there were more rows than expected with lots of NA values. The problem was in the index. Thx Glorian and all others. – seakyourpeak Feb 17 '22 at 17:21