Pandas : join/merge/concat two dataframes

Question

Question : Count how many people has published paper(s) to more than one topic

Exemple:

Poo has two papers which is associated to Physics only and since it has only one topic, it is not considered to be counted
Amy has one paper with two(2) topics (Physics, Economics), so it should be counted
Baa has two paper with distinct topic for each, so it should be counted

Example Dataframes:

|   | id  | name | has_published_papers                    |
|---|-----|------|-----------------------------------------|
| 0 | 100 | Amy  | pp3524172                               |
| 1 | 101 | Bla  | pp0120888,pp0343660,pp0151738,pp0120631 |
| 2 | 102 | Foo  | pp0134084,pp1262416,pp0120082,pp0117571 |
| 3 | 103 | Boo  | pp0274558,pp0108872,pp1796960,pp0117509 |
| 4 | 104 | Soo  | pp0120338,pp0993846,pp1375666,pp0407887 |
| 5 | 111 | Poo  | pp0152095,pp1234567                     |
| 6 | 112 | Baa  | pp0237474,pp0152095                     |

|    | paper_id  | name        | topics             |
|----|-----------|-------------|--------------------|
| 30 | pp3524172 | research A  | Physics, Economics |
| 40 | pp0189076 | research B  | Math, Physics      |
| 55 | pp0237474 | research C  | Education          |
| 68 | pp2729488 | research D  | Physics, Math      |
| 79 | pp0152095 | research Z  | Physics            |
| 99 | pp1234567 | research X  | Physics            |

Edit The desired output is an integer representing the number of people who has published paper(s) to more than one topic

What have you tried so far ? What is the desired output ? Please, have a look at [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). — Alexandre B., Jul 24 '20 at 13:06

Daniel R · Answer 1 · 2020-07-24T13:48:45.180

You need a few steps to merge those datasets.

First thing you got to do is to split and expand the has_published_papers column into multiple columns:

authors_df = pd.DataFrame({
  'id': [100, 101],
  'name': ['Amy', 'Bla'],
  'has_published_papers': ['pp3524172', 'pp0120888,pp0343660,pp0151738,pp0120631']
})
authors_df.has_published_papers.str.split(',', expand=True)

This would output

           0          1          2          3
0  pp3524172       None       None       None
1  pp0120888  pp0343660  pp0151738  pp0120631

Then you can concatenate this to your original dataframe and melt it:

authors_papers_df = (
    pd.concat([
        authors_df.drop(columns=['has_published_papers']),
        authors_df.has_published_papers.str.split(',', expand=True)
    ], axis=1)
    .melt(['id', 'name'], value_name='paper_id')
    .dropna(subset=['paper_id']))

this would output a nice dataframe ready for merging:

    id name variable   paper_id
0  100  Amy        0  pp3524172
1  101  Bla        0  pp0120888
3  101  Bla        1  pp0343660
5  101  Bla        2  pp0151738
7  101  Bla        3  pp0120631

You can do the exact same split/expand/melt pipeline to create a papers_topics dataframe.

papers_topics_df = (
    pd.concat([
        papers_df.drop(columns=['topics']),
        papers_df.topics.str.split(', ', expand=True)
    ], axis=1)
    .melt(['paper_id', 'name'], value_name='topic')
    .dropna(subset=['topic'])
)

Then you can merge then on paper_id.

authors_papers_topics_df = authors_papers_df.merge(papers_topics_df, on='paper_id')

Now you have a dataframe relating topics, papers and authors.

To count unique topics for each author, you can use:

authors_papers_topics_df.groupby('id')['topics'].nunique()

Saw your edit. To get number of authors that has more than one topic, use:

np.sum(authors_papers_topics_df.groupby('id')['topics'].nunique() > 1)

Valdi_Bo · Accepted Answer · 2020-07-24T14:06:17.683

First convert both DataFrames:

Your first DataFrame (df) to a DataFrame with separate row for each paper and the paper id as the index:

 paper = df.set_index('name').has_published_papers.str.split(',')\
     .explode().reset_index(name='id').set_index('id')

The result is:

           name
 id            
 pp3524172  Amy
 pp0120888  Bla
 pp0343660  Bla
 pp0151738  Bla
 pp0120631  Bla
 pp0134084  Foo
 pp1262416  Foo
 pp0120082  Foo
 pp0117571  Foo
 pp0274558  Boo
 pp0108872  Boo
 pp1796960  Boo
 pp0117509  Boo
 pp0120338  Soo
 pp0993846  Soo
 pp1375666  Soo
 pp0407887  Soo
 pp0152095  Poo
 pp1234567  Poo
 pp0237474  Baa
 pp0152095  Baa

Your second DataFrame (df2) to a Series, dividing each topics row into separate elements and again the paper id as the index:

 topic = df2.set_index('paper_id').topics.str.split(', ').explode()

The result is:

 paper_id
 pp3524172      Physics
 pp3524172    Economics
 pp0189076         Math
 pp0189076      Physics
 pp0237474    Education
 pp2729488      Physics
 pp2729488         Math
 pp0152095      Physics
 pp1234567      Physics
 Name: topics, dtype: object

Then:

join both of them,
drop duplicates (to eliminate cases where somebody has more papers on the same topic),
group by name and count topics for each person.

The code to do it is:

result = paper.join(topic).dropna().reset_index(drop=True)\
    .drop_duplicates().groupby('name').count().reset_index()

getting:

  name  topics
0  Amy       2
1  Baa       2
2  Poo       1

And to get the true final result, retrieve rows with topics > 1 and get only name column:

result.query('topics > 1').name

The final result is:

0    Amy
1    Baa
Name: name, dtype: object

Or if you want to get plain pythonic list (instead of a Series), add .tolist() to the above instruction, which this time will get:

['Amy', 'Baa']

Pandas : join/merge/concat two dataframes

2 Answers2