I have a pandas dataframe containing the details of some papers with approximately 4 million records. I need to find the top 50 authors that have the largest number of publications within this data set. I have two files containing this data so I've to read both of them into dataframes and append them together to get a single dataframe to work with. I've only used the author column from the dataframe as there are 32 other columns that are not required.
So far, I've come up with the following solution. Also, this is an algorithms assignment so I cannot use any inbuilt algorithms. Currently, I'm using a dictionary to store the publication count of every author and then I'm looping over the dict to get the most published authors. Also, there can be multiple authors within a single row like 'Auth 1 | Auth 2 | Auth 3|' which is why I'm splitting the string.
I wanted to know if there is a faster method to accomplish all of this. Is there some way that I can find the maximum during my loop of the dataframe? Again, I'm not permitted to use inbuilt algorithms for searching or sorting. Any suggestions would help.
Thank you.
start_time = ti.default_timer()
only_authors_article = pd.DataFrame(articles['author'])
only_authors_inproceedings = pd.DataFrame(proceedings['author'])
all_authors = only_authors_article.append(only_authors_inproceedings, ignore_index = True)
all_authors = all_authors.dropna(how = 'any')
auth_dict = defaultdict(int)
for auth_list in zip(all_authors['author']):
auth_list = auth_list[0]
if '|' in auth_list:
auths = auth_list.split('|')
for auth in auths:
auth_dict[auth] += 1
else:
auth_dict[auth_list] += 1
most_pub_authors = []
for i in range(0, 50):
max_pub_count = 0
max_pub_auth = None
for author, pub_count in auth_dict.items():
if pub_count > max_pub_count:
max_pub_count = pub_count
max_pub_auth = author
most_pub_authors.append( (max_pub_auth, max_pub_count) )
del auth_dict[max_pub_auth]
print(most_pub_authors)
elapsed_time = ti.default_timer() - start_time
print("Total time taken: " + str(elapsed_time))
Edit 1: Some sample data from all_authors
author
0 Sanjeev Saxena
1 Hans Ulrich Simon
2 Nathan Goodman|Oded Shmueli
3 Norbert Blum
4 Arnold Schönhage
5 Juha Honkala
6 Christian Lengauer|Chua-Huang Huang
7 Alain Finkel|Annie Choquet
8 Joachim Biskup
9 George Rahonis|Symeon Bozapalidis|Zoltán Fülöp...
10 Alex Kondratyev|Maciej Koutny|Victor Khomenko|...
11 Wim H. Hesselink
12 Christian Ronse
13 Carol Critchlow|Prakash Panangaden
14 Fatemeh Ghassemi|Ramtin Khosravi|Rosa Abbasi
15 Robin Milner
16 John Darlington
17 Giuseppe Serazzi|M. Italiani|Maria Calzarossa
18 Vincent Vajnovszki
19 Christian Stahl|Richard Müller 0001|Walter Vogler
20 Luc Devroye
21 K. C. Tan|T. C. Hu
22 William R. Franta
23 Ekkart Kindler
24 Demetres D. Kouvatsos
25 Christian Lengauer|Sergei Gorlatch
26 Roland Meyer
27 Stefan Reisch
28 Erzsébet Csuhaj-Varjú|Victor Mitrana
29 Lila Kari|Manasi S. Kulkarni