How to used groupby and CountVectorizer() together in pandas Dataframe?

Question

I have this sample data. This is a CSV file. I want to create feature vectors of 'Questions' and 'Replies' columns using Bag-of-Word method (CounterVector()) and then calculate the cosine similarity between the question and their replies.

So far I have this python code:

topFeaturesValueListColumns = ['cosinSimilarityIpostRpost', 'Class']
topFeaturesValueList = []

featureVectorsPD = pd.DataFrame()
df = pd.read_csv("test1.csv", usecols = ['ThreadID', 'Title', 'UserID_inipst', 'Questions', 'UserID', 'Replies', 'Class'])
df = pd.DataFrame(df)

df = df.apply(lambda x: x.astype(str).str.lower())

for column in df:
  df[column] = df[column].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

cv = CountVectorizer()

features =cv.fit(df['Title']+' '+df['UserID_inipst']+' '+df['Questions']+' '+df['UserID']+' '+df['Replies'])

print(features.vocabulary_)

featureVectorsPD['Questions'] = cv.transform(df['Questions']).toarray().tolist()
featureVectorsPD['Replies'] = cv.transform(df['Replies']).toarray().tolist()
featureVectorsPD['Class'] = df['Class']

for i in range(len(featureVectorsPD)):
    q=np.array([featureVectorsPD['Questions'][i]])
    r=np.array([featureVectorsPD['Replies'][i]])
    label = featureVectorsPD['Class'][i]
    res = cosine_similarity(q, r, dense_output=True)
    res = float(np.asscalar(res[0]))
    row = [res, label]
    topFeaturesValueList.append(row)

topQDFValuesPD = pd.DataFrame(topFeaturesValueList, columns=topFeaturesValueListColumns)
print(topQDFValuesPD)

Problem in this code is that the

features = cv.fit(df['Questions'] + ' ' + df['Replies'])

creates words dictionary (features.vocabulary_) from the whole "Questions" and "Replies" columns but my requirement is to calculate "vocabulary" for each thread individually and then create features vectors based on that individual dictionary. in other words in "ThreadID" column when values changes new vocabulary should be created.

I think "groupby" function is used here but how? Hope the question is clear. Please help me. I will be very thankful to you.

Please include the sample data as text in your question, not as a picture, so potential answerers can copy/paste and reproduce your issue — G. Anderson, Dec 09 '19 at 15:40
Does this answer your question? [Concatenate strings from several rows using Pandas groupby](https://stackoverflow.com/questions/27298178/concatenate-strings-from-several-rows-using-pandas-groupby) — G. Anderson, Dec 09 '19 at 15:40
Why does it matter that you would have different vocabulary for different ThreadID? If your ThreadID does not have a specific word in the vocabulary, it will just be 0 with a count vector, there is not really an issue there from what I can see. — Thomas, Jan 24 '20 at 12:41

How to used groupby and CountVectorizer() together in pandas Dataframe?

0 Answers0