Another approach
First groupby BookID
and list all authors per book (i.e. list all authors per group)
combos = df.groupby('BookID').agg(lambda x: list(x)).reset_index(drop=False)
print(combos)
BookID Author
0 1 [John, Alex, Jenna]
1 2 [John, Alex]
2 3 [John]
3 4 [Alex, Mary, Max]
Next, merge back with main data, on BookID
, to get all authors for each author
merged = combos.merge(df, how='inner', on='BookID')
print(merged)
BookID Author_x Author_y
0 1 [John, Alex, Jenna] John
1 1 [John, Alex, Jenna] Alex
2 1 [John, Alex, Jenna] Jenna
3 2 [John, Alex] John
4 2 [John, Alex] Alex
5 3 [John] John
6 4 [Alex, Mary, Max] Alex
7 4 [Alex, Mary, Max] Mary
8 4 [Alex, Mary, Max] Max
Author_x
is the full author list and includes Author_y
. It is now possible to compare the full author list (Author_x
) with each individual/unique author (Author_y
), with the following approach
- Create dict whose keys are unique
Author_y
values (i.e. unique authors) and values are blank lists
- Iterate over each key-value pair in dict
- slice merged dataframe from above step using
Author_y
column; this gives all authors for the author in the dict key
- from slice, get list of all authors (
Author_x
) as flattened list
- extend blank list with difference between flattened list (all authors) and dict key
d = {auth:[] for auth in df['Author'].unique()}
for k,v in d.items():
all_auths = merged[merged['Author_y']==k]['Author_x'].values.tolist()
auths = [coauths for nested in all_auths for coauths in nested]
v.extend(list(set(auths) - set([k])))
Finally, put into DataFrame
and count the non-null values per row
cnames = ['coauth'+str(k) for k in range(1,len(d))]
df_summary = pd.DataFrame.from_dict(d, orient='index', columns=cnames)
df_summary['Num_Unique_CoAuthors'] = df_summary.shape[1] - df_summary.isna().sum(axis=1)
print(df_summary)
author coauth1 coauth2 coauth3 coauth4 Num_Unique_CoAuthors
0 John Alex Jenna None None 2
1 Alex Max John Mary Jenna 4
2 Jenna John Alex None None 2
3 Mary Max Alex None None 2
4 Max Alex Mary None None 2
Extended data case
If the main data contains a single author (i.e. without any co-authors), then this method prints zero for that row
Here is a dummy row added to the data, with a single author
print(df)
BookID Author
0 1 John
1 1 Alex
2 1 Jenna
3 2 John
4 2 Alex
5 3 John
6 4 Alex
7 4 Mary
8 4 Max
9 5 Tom
And here is the output
author coauth1 coauth2 coauth3 coauth4 Num_Unique_CoAuthors
0 John Jenna Alex None None 2
1 Alex Mary John Jenna Max 4
2 Jenna John Alex None None 2
3 Mary Max Alex None None 2
4 Max Mary Alex None None 2
5 Tom None None None None 0
Initial Answer
Did you try a groupby
with sum
aggregation
df.groupby(['Author'])['BookID'].sum()