I have DataFrame which has filenames in first column and vector of decimal numbers in second column (which is of Pandas' type Series
). DataFrame was loaded from CSV that looks like this:
,filename,vector
0,my-filename,"[1.2 3.1 2.6 ...]"
1,another-filename,"[1.1 3.3 2.2 ...]"
...
I have this function scipy.spatial.distance.correlation(vec1, vec2)
and some input vector. I need to compare that input vector with every vector in DataFrame using specified function, and get n most correlated filenames.
Right now I am doing that by iterating over DataFrame, calculating correlations, saving results, sorting them and then taking n most correlated. I have read this answer which basically says that iterating over DataFrame is bad (unless you have very good reason), so I am wondering if there is a better way. I can also adjust arrangement of data in DataFrame if needed.
So, how can one "vectorize" this?