1

I have a dataframe like

    title                                               titlenew
0   Two Workers Are Struck By Motor Vehicle And O...    two workers are struck by motor vehicle and o...
1   Foreman Is Fatally Crushed When Forklift Tips...    foreman is fatally crushed when forklift tips...
2   Employee Suffers Abdominal Fracture In Fall F...    employee suffers abdominal fracture in fall f...
3   Employee'S Body Is Caught In Asphalt Machine ...    employee's body is caught in asphalt machine ...
4   Employee Is Punctured In Abdomen With Nail  employee is punctured in abdomen with nail

that I converted to vectors for NLP processing. They now look like

    card2vec_title                                      card2vec_titlenew
0   [0.09446411579847336, 0.18325935304164886, 0.1...   [0.01013200543820858, -0.015507892705500126, 0...
1   [0.11135150492191315, 0.16989260911941528, 0.1...   [0.0871051624417305, 0.07891112565994263, -0.0...
2   [-0.019224125891923904, 0.3285079598426819, -0...   [0.052899472415447235, 0.2530696988105774, -0....
3   [0.06179530546069145, 0.10462947934865952, 0.0...   [0.05848287418484688, 0.062050893902778625, -0...
4   [0.0604548417031765, 0.2742682993412018, -0.00...   [0.09018705040216446, 0.23053207993507385, -0.

My question is, how can I find the correlation score (or cosine similarity) of these 2 columns. Doing df.card2vec_titlenew.corr(df.card2vec_title) gives an error saying

unsupported operand type(s) for /: 'list' and 'int'

My question is on correlation hence not providing the code for how I converted the strings to vectors. Help is appreciated. Thanks

DarkFantasy
  • 240
  • 3
  • 16
  • is there any help here? – DarkFantasy Jan 11 '21 at 11:37
  • show your code. you should be able run corr from the dataframe, sns.heatmap(weights_df.corr(), center=0, cmap=cmap, linewidths=1, annot=True, fmt=".2f") and send it to a heatmap. mask = np.triu(np.ones_like(corr, dtype=bool)) sns.heatmap(corr, cmap=cmap, center=0, linewidths=1, annot=True, fmt=".2f",mask=mask) – Golden Lion Jan 27 '21 at 17:29

1 Answers1

0

Correlation is a one to one, or many to one function, hence the error of list and int, as it expects one int value and not list of values.

Things you could do:

  • Do a dot product of the multiple values in each row and then find the correlation between two columns

  • Another way would be to split the column for each value(Something like this) and then find the correlation between the new columns. However this could be tricky if your columns have different length of words.