2

Question : Count how many people has published paper(s) to more than one topic

Exemple:

  1. Poo has two papers which is associated to Physics only and since it has only one topic, it is not considered to be counted
  2. Amy has one paper with two(2) topics (Physics, Economics), so it should be counted
  3. Baa has two paper with distinct topic for each, so it should be counted

Example Dataframes:

|   | id  | name | has_published_papers                    |
|---|-----|------|-----------------------------------------|
| 0 | 100 | Amy  | pp3524172                               |
| 1 | 101 | Bla  | pp0120888,pp0343660,pp0151738,pp0120631 |
| 2 | 102 | Foo  | pp0134084,pp1262416,pp0120082,pp0117571 |
| 3 | 103 | Boo  | pp0274558,pp0108872,pp1796960,pp0117509 |
| 4 | 104 | Soo  | pp0120338,pp0993846,pp1375666,pp0407887 |
| 5 | 111 | Poo  | pp0152095,pp1234567                     |
| 6 | 112 | Baa  | pp0237474,pp0152095                     |
|    | paper_id  | name        | topics             |
|----|-----------|-------------|--------------------|
| 30 | pp3524172 | research A  | Physics, Economics |
| 40 | pp0189076 | research B  | Math, Physics      |
| 55 | pp0237474 | research C  | Education          |
| 68 | pp2729488 | research D  | Physics, Math      |
| 79 | pp0152095 | research Z  | Physics            |
| 99 | pp1234567 | research X  | Physics            |

Edit The desired output is an integer representing the number of people who has published paper(s) to more than one topic

Augusto
  • 107
  • 10
  • What have you tried so far ? What is the desired output ? Please, have a look at [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – Alexandre B. Jul 24 '20 at 13:06

2 Answers2

5

You need a few steps to merge those datasets.

First thing you got to do is to split and expand the has_published_papers column into multiple columns:

authors_df = pd.DataFrame({
  'id': [100, 101],
  'name': ['Amy', 'Bla'],
  'has_published_papers': ['pp3524172', 'pp0120888,pp0343660,pp0151738,pp0120631']
})
authors_df.has_published_papers.str.split(',', expand=True)

This would output

           0          1          2          3
0  pp3524172       None       None       None
1  pp0120888  pp0343660  pp0151738  pp0120631

Then you can concatenate this to your original dataframe and melt it:

authors_papers_df = (
    pd.concat([
        authors_df.drop(columns=['has_published_papers']),
        authors_df.has_published_papers.str.split(',', expand=True)
    ], axis=1)
    .melt(['id', 'name'], value_name='paper_id')
    .dropna(subset=['paper_id']))

this would output a nice dataframe ready for merging:

    id name variable   paper_id
0  100  Amy        0  pp3524172
1  101  Bla        0  pp0120888
3  101  Bla        1  pp0343660
5  101  Bla        2  pp0151738
7  101  Bla        3  pp0120631

You can do the exact same split/expand/melt pipeline to create a papers_topics dataframe.

papers_topics_df = (
    pd.concat([
        papers_df.drop(columns=['topics']),
        papers_df.topics.str.split(', ', expand=True)
    ], axis=1)
    .melt(['paper_id', 'name'], value_name='topic')
    .dropna(subset=['topic'])
)

Then you can merge then on paper_id.

authors_papers_topics_df = authors_papers_df.merge(papers_topics_df, on='paper_id')

Now you have a dataframe relating topics, papers and authors.

To count unique topics for each author, you can use:

authors_papers_topics_df.groupby('id')['topics'].nunique()

Saw your edit. To get number of authors that has more than one topic, use:

np.sum(authors_papers_topics_df.groupby('id')['topics'].nunique() > 1)
Daniel R
  • 1,954
  • 1
  • 14
  • 21
2

First convert both DataFrames:

  1. Your first DataFrame (df) to a DataFrame with separate row for each paper and the paper id as the index:

     paper = df.set_index('name').has_published_papers.str.split(',')\
         .explode().reset_index(name='id').set_index('id')
    

    The result is:

               name
     id            
     pp3524172  Amy
     pp0120888  Bla
     pp0343660  Bla
     pp0151738  Bla
     pp0120631  Bla
     pp0134084  Foo
     pp1262416  Foo
     pp0120082  Foo
     pp0117571  Foo
     pp0274558  Boo
     pp0108872  Boo
     pp1796960  Boo
     pp0117509  Boo
     pp0120338  Soo
     pp0993846  Soo
     pp1375666  Soo
     pp0407887  Soo
     pp0152095  Poo
     pp1234567  Poo
     pp0237474  Baa
     pp0152095  Baa
    
  2. Your second DataFrame (df2) to a Series, dividing each topics row into separate elements and again the paper id as the index:

     topic = df2.set_index('paper_id').topics.str.split(', ').explode()
    

    The result is:

     paper_id
     pp3524172      Physics
     pp3524172    Economics
     pp0189076         Math
     pp0189076      Physics
     pp0237474    Education
     pp2729488      Physics
     pp2729488         Math
     pp0152095      Physics
     pp1234567      Physics
     Name: topics, dtype: object
    

Then:

  • join both of them,
  • drop duplicates (to eliminate cases where somebody has more papers on the same topic),
  • group by name and count topics for each person.

The code to do it is:

result = paper.join(topic).dropna().reset_index(drop=True)\
    .drop_duplicates().groupby('name').count().reset_index()

getting:

  name  topics
0  Amy       2
1  Baa       2
2  Poo       1

And to get the true final result, retrieve rows with topics > 1 and get only name column:

result.query('topics > 1').name

The final result is:

0    Amy
1    Baa
Name: name, dtype: object

Or if you want to get plain pythonic list (instead of a Series), add .tolist() to the above instruction, which this time will get:

['Amy', 'Baa']
Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41