2

I have this sample data. This is a CSV file. I want to create feature vectors of 'Questions' and 'Replies' columns using Bag-of-Word method (CounterVector()) and then calculate the cosine similarity between the question and their replies.

So far I have this python code:

topFeaturesValueListColumns = ['cosinSimilarityIpostRpost', 'Class']
topFeaturesValueList = []

featureVectorsPD = pd.DataFrame()
df = pd.read_csv("test1.csv", usecols = ['ThreadID', 'Title', 'UserID_inipst', 'Questions', 'UserID', 'Replies', 'Class'])
df = pd.DataFrame(df)

df = df.apply(lambda x: x.astype(str).str.lower())

for column in df:
  df[column] = df[column].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

cv = CountVectorizer()

features =cv.fit(df['Title']+' '+df['UserID_inipst']+' '+df['Questions']+' '+df['UserID']+' '+df['Replies'])

print(features.vocabulary_)

featureVectorsPD['Questions'] = cv.transform(df['Questions']).toarray().tolist()
featureVectorsPD['Replies'] = cv.transform(df['Replies']).toarray().tolist()
featureVectorsPD['Class'] = df['Class']

for i in range(len(featureVectorsPD)):
    q=np.array([featureVectorsPD['Questions'][i]])
    r=np.array([featureVectorsPD['Replies'][i]])
    label = featureVectorsPD['Class'][i]
    res = cosine_similarity(q, r, dense_output=True)
    res = float(np.asscalar(res[0]))
    row = [res, label]
    topFeaturesValueList.append(row)

topQDFValuesPD = pd.DataFrame(topFeaturesValueList, columns=topFeaturesValueListColumns)
print(topQDFValuesPD)

Problem in this code is that the

features = cv.fit(df['Questions'] + ' ' + df['Replies'])

creates words dictionary (features.vocabulary_) from the whole "Questions" and "Replies" columns but my requirement is to calculate "vocabulary" for each thread individually and then create features vectors based on that individual dictionary. in other words in "ThreadID" column when values changes new vocabulary should be created.

I think "groupby" function is used here but how? Hope the question is clear. Please help me. I will be very thankful to you.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
ibrahim
  • 83
  • 1
  • 8
  • 3
    Please include the sample data as text in your question, not as a picture, so potential answerers can copy/paste and reproduce your issue – G. Anderson Dec 09 '19 at 15:40
  • 1
    Does this answer your question? [Concatenate strings from several rows using Pandas groupby](https://stackoverflow.com/questions/27298178/concatenate-strings-from-several-rows-using-pandas-groupby) – G. Anderson Dec 09 '19 at 15:40
  • I have added link for the sample file. – ibrahim Dec 09 '19 at 16:15
  • Why does it matter that you would have different vocabulary for different ThreadID? If your ThreadID does not have a specific word in the vocabulary, it will just be 0 with a count vector, there is not really an issue there from what I can see. – Thomas Jan 24 '20 at 12:41

0 Answers0