0

For give you a clue, I make a copy of previous code

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
df = df['text'].values.tolist()
tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)
tv_matrix = tv.fit_transform(sms)
tv_matrix = tv_matrix.toarray()
vocab = tv.get_feature_names()
pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)
similarity_matrix = cosine_similarity(tv_matrix)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df.columns = similarity_df.columns.map(str)
similarity_df

The output is

           0           1           2           3    
0   1.000000    0.000000    0.038781    0.108865    
1   0.000000    1.000000    0.018147    0.000000    
2   0.038781    0.018147    1.000000    0.038326    
3   0.108865    0.000000    0.038326    1.000000

I will want to switch to two dataframe

id  text
0   "Daei rumah Indri jam berpa?Nyasar gak de,hhehhee\nSkrang sama sapa k'bogor?Orang rumah apa temen SMA "
1   'Mas dmn .. Ak udh smpe kantor yah.. Mas udh smpe blm??'
2  'Biarin .. Km ga kenal cowonya mas.. Hehe \nKm di cikeas dari jam brp?? Kok ga sms .. Knp baru sms. Wkwkw',\
3  'Wkkwkkwkk.....Asem di tanya bilang kepo\nIya sapa Ade sayang,\nMasih di cekias de sama ibu,'

and the second dataframe

df2

Id  text
A   udh smpe kantor
B   ga kenal cowonya mas

How suppose I do this?

Nabih Bawazir
  • 6,381
  • 7
  • 37
  • 70
  • What is your goal? In other words, cosine similarity of what and what is desired? Are you trying to get cosine similarity of tfidf of the same ID in different 2 dataframes? – gyoza Jul 03 '18 at 04:28
  • I am expected cross cosine similarity, my data actually is 1 million rows, based on tutorial I found, its only 1 million x 1 million and it is impractical, I need that its only 1 million x 100, so start with 2 dataframe – Nabih Bawazir Jul 03 '18 at 05:17
  • I found this would help, just trying https://stackoverflow.com/questions/17627219/whats-the-fastest-way-in-python-to-calculate-cosine-similarity-given-sparse-mat – Nabih Bawazir Jul 11 '18 at 13:17

0 Answers0