1

I have a 2D array with vectorised rows with each row representing a document in the corpus:

array[[ 0.0 0.0 0.4583 0.6584 0.0]
                              ...
      [0.4390 0.0 0.0 0.5749 0.0]]

I have calculated cosine similarity for each row/vector in the 2D array with every other vector like so:

#calculate semantic similarity for all permutations all in one go
for i in range(Vectors.shape[0]): #for each vector/row in 2D array
    for j in range(i + 1, Vectors.shape[0]): #for each row + 1 in the 2D array
        cosine_similarities = linear_kernel(Vectors[i], Vectors[j]).flatten()
        #np.savetxt("foo.csv", cosine_similarities, delimiter=",")
        pd.DataFrame(cosine_similarities).to_csv("test_matrix.csv", mode = 'a') #save into csv as a matirix

The output prior to saving into a csv looks like:

[0.5748389]
[0.5847379]
...
[0.3257490]

How am I able to transform the output into a matrix and save that into a csv?

The output I'm looking for is:

   0          1           ...  76
0  0.5748389  0.5847379        0.3257490
1  ...        ...         ...   ...
...
76

UPDATE: I followed this and it worked out! Using cosine similarity function directly on a sparse matrix worked, and then converted it to a list and then dataframe. See: What's the fastest way in Python to calculate cosine similarity given sparse matrix data? for more info!

Victoria S
  • 90
  • 7
  • Why not append all rows to a list, concatenate them with pd.concat, then save to csv all at once? – Michael Delgado May 14 '22 at 20:19
  • I've tried ```for i in cosine_similarities: list.append(i)``` and it only prints out the last row. I suspect the issue is that running cosine similarity on each row returns individual arrays, so the question is how do I concatenate all arrays into a matrix. – Victoria S May 14 '22 at 22:55
  • Yeah no I mean `some_list = []; for i in range(…): for j in range(): cos_similarities = …; some_list.append(cos_similarities)`. Collect the arrays across all loop iterations. – Michael Delgado May 15 '22 at 02:17
  • Unfortunately didn't work, it kept on printing out arrays in a list non-stop. – Victoria S May 15 '22 at 14:59

1 Answers1

0

if you cosine_similarities.shape is

(77, 77)

then try this

df=pd.DataFrame(cosine_similarities, columns=[i for i in range(0,77)], index=[i for i in range(0,77)])
df.to_csv('yourcsv.csv')

if you don't need the index as a separate column, then change this

df.to_csv('yourcsv.csv', index = False)

Hope this helps!

  • I ran ```cosine_similarities.shape``` and it came out as (1,). What does this mean and how will I then be able to put into matrix dataframe? – Victoria S May 14 '22 at 22:48
  • @VictoriaS This means that cosine_similarities.shape is not a 2D array. And I believe your approach to calculating cosine similarity is wrong. I have used the linear_kernel function to calculate the cosine similarity. You just have to pass the vector, like cosine_similarities = linear_kernel(Vectors, Vectors), which will return a 2D matrix. Hope this helps! – Thirunaavukkarasu M May 15 '22 at 01:31
  • Yes, I tried calculating cosine similarity using linear_kernel function too like the way you're doing it, but I am trying to iterate through each row in the vectorised array and calculate cosine similarity between each row with all other rows, hence the loops in my function. Somehow the function I defined gets rid of the 2D array and I'm not sure why and how I can get it back. – Victoria S May 15 '22 at 13:17
  • Actually, I just checked the shape of the array after running the function you mentioned and it returns a 1D array, so I do think Sci-kit linear-kernel function does automatically return a 1D array – Victoria S May 15 '22 at 15:13
  • 1
    @VictoriaS does not return a 2D array. Remove flatten after the functions that convert it into a single dimensional array. Moreover, you don't have to write a function to calculate the cosine value for each value with each other value, that is what the linear_kernal function is for, the function does it for you. If you are still struggling with the problem edit the question to add 10 sample values from Vectors, and what you want to do I'll try to help you out. – Thirunaavukkarasu M May 15 '22 at 20:16
  • 1
    I found a solution to it now and have edited the post with what worked! Thanks so much for your help! – Victoria S May 16 '22 at 21:02
  • Great! It's good that you got a solution. – Thirunaavukkarasu M May 17 '22 at 01:08