I have this sample data. This is a CSV file. I want to create feature vectors of 'Questions' and 'Replies' columns using Bag-of-Word method (CounterVector()) and then calculate the cosine similarity between the question and their replies.
So far I have this python code:
topFeaturesValueListColumns = ['cosinSimilarityIpostRpost', 'Class']
topFeaturesValueList = []
featureVectorsPD = pd.DataFrame()
df = pd.read_csv("test1.csv", usecols = ['ThreadID', 'Title', 'UserID_inipst', 'Questions', 'UserID', 'Replies', 'Class'])
df = pd.DataFrame(df)
df = df.apply(lambda x: x.astype(str).str.lower())
for column in df:
df[column] = df[column].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
cv = CountVectorizer()
features =cv.fit(df['Title']+' '+df['UserID_inipst']+' '+df['Questions']+' '+df['UserID']+' '+df['Replies'])
print(features.vocabulary_)
featureVectorsPD['Questions'] = cv.transform(df['Questions']).toarray().tolist()
featureVectorsPD['Replies'] = cv.transform(df['Replies']).toarray().tolist()
featureVectorsPD['Class'] = df['Class']
for i in range(len(featureVectorsPD)):
q=np.array([featureVectorsPD['Questions'][i]])
r=np.array([featureVectorsPD['Replies'][i]])
label = featureVectorsPD['Class'][i]
res = cosine_similarity(q, r, dense_output=True)
res = float(np.asscalar(res[0]))
row = [res, label]
topFeaturesValueList.append(row)
topQDFValuesPD = pd.DataFrame(topFeaturesValueList, columns=topFeaturesValueListColumns)
print(topQDFValuesPD)
Problem in this code is that the
features = cv.fit(df['Questions'] + ' ' + df['Replies'])
creates words dictionary (features.vocabulary_) from the whole "Questions" and "Replies" columns but my requirement is to calculate "vocabulary" for each thread individually and then create features vectors based on that individual dictionary. in other words in "ThreadID" column when values changes new vocabulary should be created.
I think "groupby" function is used here but how? Hope the question is clear. Please help me. I will be very thankful to you.