2

I have a really large dataframe with Book ID's and the names of people that have co-authored each book together. I want to calculate how many "unique" co-authors every author has ever worked with in his entire career.

For example:

BookID   Author
  1         John
  1         Alex
  1         Jenna
  2         John
  2         Alex
  3         John
  4         Alex
  4         Mary
  4         Max

The expected result would be (what's between parenthesis isn't required but I put it there to better justify, I just want the number):

Author     Num_Unique_CoAuthors
 John           2                    (Alex and Jenna)
 Alex           4                    (John, Jenna, Mary and Max)
 Jenna          2                    (Alex and John)
 Mary           2                    (Alex and Max)
 Max            2                    (Alex and Mary)
BKS
  • 2,227
  • 4
  • 32
  • 53

3 Answers3

4

First create sets per groups to new column, then get difference with Author column, remove empty sets by boolean indexing and last flatten values to new sets for unique, last get length:

df = df.join(df.groupby('BookID')['Author'].apply(set).rename('new'), 'BookID')

df['new'] = [b - set([a]) for a, b in zip(df['Author'], df['new'])]

df = (df[df['new'].astype(bool)].groupby('Author')['new']
          .apply(lambda x: tuple(set([z for y in x for z in y])))
          .to_frame())

df.insert(0, 'Num_Unique_CoAuthors', df['new'].str.len())
print (df)
        Num_Unique_CoAuthors                       new
Author                                                
Alex                       4  (Max, John, Jenna, Mary)
Jenna                      2              (John, Alex)
John                       2             (Jenna, Alex)
Mary                       2               (Max, Alex)
Max                        2              (Mary, Alex)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

Another approach

First groupby BookID and list all authors per book (i.e. list all authors per group)

combos = df.groupby('BookID').agg(lambda x: list(x)).reset_index(drop=False)
print(combos)
   BookID               Author
0       1  [John, Alex, Jenna]
1       2         [John, Alex]
2       3               [John]
3       4    [Alex, Mary, Max]

Next, merge back with main data, on BookID, to get all authors for each author

merged = combos.merge(df, how='inner', on='BookID')
print(merged)
   BookID             Author_x Author_y
0       1  [John, Alex, Jenna]     John
1       1  [John, Alex, Jenna]     Alex
2       1  [John, Alex, Jenna]    Jenna
3       2         [John, Alex]     John
4       2         [John, Alex]     Alex
5       3               [John]     John
6       4    [Alex, Mary, Max]     Alex
7       4    [Alex, Mary, Max]     Mary
8       4    [Alex, Mary, Max]      Max

Author_x is the full author list and includes Author_y. It is now possible to compare the full author list (Author_x) with each individual/unique author (Author_y), with the following approach

  1. Create dict whose keys are unique Author_y values (i.e. unique authors) and values are blank lists
  2. Iterate over each key-value pair in dict
  3. slice merged dataframe from above step using Author_y column; this gives all authors for the author in the dict key
  4. from slice, get list of all authors (Author_x) as flattened list
  5. extend blank list with difference between flattened list (all authors) and dict key
d = {auth:[] for auth in df['Author'].unique()}
for k,v in d.items():
    all_auths = merged[merged['Author_y']==k]['Author_x'].values.tolist()
    auths = [coauths for nested in all_auths for coauths in nested]
    v.extend(list(set(auths) - set([k])))

Finally, put into DataFrame and count the non-null values per row

cnames = ['coauth'+str(k) for k in range(1,len(d))]
df_summary = pd.DataFrame.from_dict(d, orient='index', columns=cnames)
df_summary['Num_Unique_CoAuthors'] = df_summary.shape[1] - df_summary.isna().sum(axis=1)
print(df_summary)
  author coauth1 coauth2 coauth3 coauth4  Num_Unique_CoAuthors
0   John    Alex   Jenna    None    None                     2
1   Alex     Max    John    Mary   Jenna                     4
2  Jenna    John    Alex    None    None                     2
3   Mary     Max    Alex    None    None                     2
4    Max    Alex    Mary    None    None                     2

Extended data case

If the main data contains a single author (i.e. without any co-authors), then this method prints zero for that row

Here is a dummy row added to the data, with a single author

print(df)
   BookID Author
0       1   John
1       1   Alex
2       1  Jenna
3       2   John
4       2   Alex
5       3   John
6       4   Alex
7       4   Mary
8       4    Max
9       5    Tom

And here is the output

  author coauth1 coauth2 coauth3 coauth4  Num_Unique_CoAuthors
0   John   Jenna    Alex    None    None                     2
1   Alex    Mary    John   Jenna     Max                     4
2  Jenna    John    Alex    None    None                     2
3   Mary     Max    Alex    None    None                     2
4    Max    Mary    Alex    None    None                     2
5    Tom    None    None    None    None                     0

Initial Answer

Did you try a groupby with sum aggregation

df.groupby(['Author'])['BookID'].sum()
edesz
  • 11,756
  • 22
  • 75
  • 123
  • 2
    Wouldn't this just give us how many books he's written? Not how many co-authors he has? – BKS Feb 12 '19 at 10:03
  • Also the numbers under Book ID are just IDs, they might as well be strings. Shouldn't be treated as integers, so using sum wouldn't make sense. :) – BKS Feb 12 '19 at 10:07
  • @BKS, thanks for your comments here. I have updated this answer. – edesz Feb 12 '19 at 13:59
0

I have an alternate solution.

  1. Join on BookID
  2. Create an adjacency matrix using crosstab
  3. Tally up the counts along the rows, not including the author of the row.

>>> df_merge = df.merge(df, on='BookID')
>>> ctdf = pd.crosstab(df_merge.Author_x, df_merge.Author_y, aggfunc='max', values=[1] * len(df_merge)).fillna(0)
>>> ctdf
Author_y  Alex  Jenna  John  Mary  Max
Author_x
Alex       1.0    1.0   1.0   1.0  1.0
Jenna      1.0    1.0   1.0   0.0  0.0
John       1.0    1.0   1.0   0.0  0.0
Mary       1.0    0.0   0.0   1.0  1.0
Max        1.0    0.0   0.0   1.0  1.0
>>> ctdf.apply(lambda x: sum([*x]) - 1)
Author_y
Alex     4.0
Jenna    2.0
John     2.0
Mary     2.0
Max      2.0
dtype: float64
Scratch'N'Purr
  • 9,959
  • 2
  • 35
  • 51