I have a Python DataFrame with columns paper, author, col_1, col_2, ..., col_100
.
Dtypes:
Paper type: string (unique)
Author type: string
col_x: float
I understand what I try to do is complex and performance heavy, but my solution takes literally ages to finish.
For every row in the DataFrame, I want to self-join with all the authors that are not the same with the author
in that row. Then apply a function in values of col_x
and each row on the joined with the other col_x
and get some aggregated results.
My solution uses the iterrows
that I know is the slowest, but I cannot think of any other way.
from sklearn.metrics.pairwise import cosine_similarity
from statistics import mean
papers = ... #is my dataframe
cols = ['min_col', 'avg_col', 'max_col', 'label']
all_cols = ['col_1', 'col_2', ..., 'col_100']
df_result = pd.DataFrame({}, columns = cols)
for ind, paper in papers.iterrows():
col_vector = paper[all_cols].values.reshape(1,-1) #bring the columns in the correct format
temp = papers[papers.author != paper.author].author.unique() #get all authors that are not the same with the one in the row
for auth in temp:
temp_papers = papers[papers.author == auth] #get all papers of that author
if temp_papers.shape[0] > 1: #if I have more than 1 paper find the cosine_similarity of the row and the joined rows
res = []
for t_ind, t_paper in temp_papers.iterrows():
res.append(cosine_similarity(col_vector, t_paper[all_cols].values.reshape(1,-1))[0][0])
df_result = df_result.append(pd.DataFrame([[min(res), mean(res), max(res), 0]], columns = cols), ignore_index = True)
Version 2:
I tried also to do a cross join with itself and just exclude after that the rows that have the same author. However, when I do it, I get the same error in several lines.
papers['key'] = 0['key'] = 0
cross = papers.merge(papers, on = 'key', how = 'outer')
>> [IPKernelApp] WARNING | No such comm: 3a1ea2fa71f711ea847aacde48001122
Extra info
DataFrame has a size of 45k rows
There are about 5k unique authors