1

I have a Python DataFrame with columns paper, author, col_1, col_2, ..., col_100.

Dtypes:
Paper type: string (unique)
Author type: string
col_x: float

I understand what I try to do is complex and performance heavy, but my solution takes literally ages to finish.

For every row in the DataFrame, I want to self-join with all the authors that are not the same with the author in that row. Then apply a function in values of col_x and each row on the joined with the other col_x and get some aggregated results.

My solution uses the iterrows that I know is the slowest, but I cannot think of any other way.

from sklearn.metrics.pairwise import cosine_similarity
from statistics import mean 

papers = ... #is my dataframe
cols = ['min_col', 'avg_col', 'max_col', 'label']
all_cols = ['col_1', 'col_2', ..., 'col_100']

df_result = pd.DataFrame({}, columns = cols)

for ind, paper in papers.iterrows():

    col_vector = paper[all_cols].values.reshape(1,-1) #bring the columns in the correct format   
    temp = papers[papers.author != paper.author].author.unique() #get all authors that are not the same with the one in the row
    for auth in temp:
        temp_papers = papers[papers.author == auth]  #get all papers of that author 
        if temp_papers.shape[0] > 1: #if I have more than 1 paper find the cosine_similarity of the row and the joined rows
            res = []
            for t_ind, t_paper in temp_papers.iterrows():
                res.append(cosine_similarity(col_vector, t_paper[all_cols].values.reshape(1,-1))[0][0])

            df_result = df_result.append(pd.DataFrame([[min(res), mean(res), max(res), 0]], columns = cols), ignore_index = True)

Version 2:

I tried also to do a cross join with itself and just exclude after that the rows that have the same author. However, when I do it, I get the same error in several lines.

papers['key'] = 0['key'] = 0
cross = papers.merge(papers, on = 'key', how = 'outer')
>> [IPKernelApp] WARNING | No such comm: 3a1ea2fa71f711ea847aacde48001122

Extra info

  • DataFrame has a size of 45k rows

  • There are about 5k unique authors

Tasos
  • 7,325
  • 18
  • 83
  • 176
  • have you already tried if your operation can be completed through numpy functions? refer to ubuntu's answer here: https://stackoverflow.com/questions/7837722/what-is-the-most-efficient-way-to-loop-through-dataframes-with-pandas – emiljoj Apr 01 '20 at 12:44
  • @emiljoj Νο. I haven't tried. Sometimes it's easier for me to form the problem in a more linear way of thinking through a simple iteration. But if there is a way to do it with vectors, then I am happy to try :) – Tasos Apr 01 '20 at 12:46
  • @Tasos could you add some input, like 10-20 rows with few authors, and a couple col_x to be able to test your code and see the output. Also, how many unique authors and overall rows you have in your dataframe, the answer may depends on these two numbers as well ;) – Ben.T Apr 01 '20 at 14:26
  • 1
    @Ben.T Added the info about the sizes. Will try to get an example in a format that can be posted here – Tasos Apr 01 '20 at 14:28
  • @Tasos so if I understand well your code, in df_result, you will have for each row in the original paper, one row per other authors. So is will be about 45K * 5K so 250millions rows, is taht right? – Ben.T Apr 01 '20 at 14:47
  • More or less correct @Ben.T.. It will be less since I have this if statement `if temp_papers.shape[0] > 1` but the order of size is about that. – Tasos Apr 01 '20 at 14:51
  • @Tasos I see, I guess there are some ways to speed up your code, here is what you can include in your question for dummy input: `np.random.seed(10); papers = pd.DataFrame({'author': list('aabbcdddde'), 'col_1': np.random.randint(30, size=10), 'col_2': np.random.randint(20, size=10), 'col_3': np.random.randint(10, size=10),})` – Ben.T Apr 01 '20 at 15:07
  • @Tasos does the output order matter? – Ben.T Apr 01 '20 at 19:31
  • No. The order doesn't matter at all – Tasos Apr 02 '20 at 15:12

1 Answers1

1

First if the dataframe is not too big (in your case it seems to be), you can do it by using the vectorization of cosine_similarity. to do so, you first need a mask with all author having more than 1 row, create a dataframe with enough informations in index and columns to be able to groupby then query the rows you want:

# here are dummy variables
np.random.seed(10)
papers = pd.DataFrame({'author': list('aabbcdddae'), 
                       'col_1': np.random.randint(30, size=10), 
                       'col_2': np.random.randint(20, size=10), 
                       'col_3': np.random.randint(10, size=10),})
all_cols = ['col_1', 'col_2','col_3']

First solution:

#mask author with more than 1 row
mask_author = papers.groupby('author')['author'].transform('count').gt(1)

# use cosine_similarity with all the rows at a time
# compared to all the rows with authors with more than a row
df_f = (pd.DataFrame(cosine_similarity(papers.loc[:,all_cols],papers.loc[mask_author,all_cols]), 
                     # create index and columns to keep some info about authors
                     index=pd.MultiIndex.from_frame(papers['author'].reset_index(), 
                                                    names=['index_ori', 'author_ori']), 
                     columns=papers.loc[mask_author,'author'])
          # put all columns as rows to be able to perform a groupby all index levels and agg
          .stack()
          .groupby(level=[0,1,2], axis=0).agg([min, 'mean', max])
          # remove rows that compared authors with themself
          .query('author_ori != author')
          # add label column with 0, not sure why
          .assign(label=0)
          # reset index as you don't seem to care
          .reset_index(drop=True))

Now the problem is that with 45K rows and 5K authors, I have some doubt that a normal computer can handle the previous method. The idea is then to do the same operation but per group author:

# mask for authors with more than a row
mask_author = papers.groupby('author')['author'].transform('count').gt(1)
# instead of doing it for each iteration, save the df with authors with more than a row
papers_gt1 = papers.loc[mask_author, :]

# compared to your method, it is more efficient to same dataframes in a list and concat at the end
# than using append on a dataframe at each iteration
res = []
# iterate over each authors
for auth, dfg in papers[all_cols].groupby(papers['author']):
    # mask for to remove the current author of the comparison df
    mask_auth = papers_gt1['author'].ne(auth)
    # append the dataframe build on the same idea than the first solution
    # with small difference in operation as you already have not the same author in both 
    # dfg and papers_gt1.loc[mask_auth, all_cols]
    res.append(pd.DataFrame(cosine_similarity(dfg, papers_gt1.loc[mask_auth, all_cols]), 
                            columns=papers_gt1.loc[mask_auth, 'author'])
                 .stack()
                 .groupby(level=[0, 1]).agg([min, 'mean', max]))
#outside of the loop concat everything and add label column
df_f = pd.concat(res, ignore_index=True).assign(label=0)

Note: the operation is still long overall, but in your code, you were loosing efficiency at several levels, here few points to increase efficiency of your code if you want to keep iterrows:

  • using iterrows is not recommended as you mentioned, but two iterrows plus another loop is really slow
  • the second iterrows was not taking advantage that cosine_similarity is vetorized for input arrays with several dimensions
  • doing temp = papers[papers.author != paper.author].author.unique() at each iteration is a big lost of time, the list of unique author could be created before and then in the loop just check that the current paper.author is not like auth (using your notation)
  • a bit in the same idea, doing if temp_papers.shape[0] > 1 at each auth could be done before, I assume the number of paper does not change, so if you create the list of unique auth outside loops (previous point), it could already not include author with only one paper
  • finally, using append on a dataframe at each loop auth is a big lost of time, see this answer for timing comparison, so the best would be to create another list res_agg that you would do res_agg.append([min(res), mean(res), max(res), 0]) and after all the loops, df_result=pd.DataFrame(res_agg, columns=cols)
Ben.T
  • 29,160
  • 6
  • 32
  • 54