7

I have a dataframe that looks like the following:

publication_title    authors                             type ...
title 1              ['author1', 'author2', 'author3']   proceedings
title 2              ['author4', 'author5']              collections
title 3              ['author6', 'author7']              books
.
.
. 

What I want to do is take the column 'authors' and split the list inside it into several rows by duplicating all the other columns, and I want also to store the results in a new column named: 'author' and keep the original column.

The following depicts exactly what I want to achieve:

publication_title    authors                             author          type ...
title 1              ['author1', 'author2', 'author3']   author1         proceedings
title 1              ['author1', 'author2', 'author3']   author2         proceedings
title 1              ['author1', 'author2', 'author3']   author3         proceedings
title 2              ['author4', 'author5']              author4         collections
title 2              ['author4', 'author5']              author5         collections
title 3              ['author6', 'author7']              author6         books
title 3              ['author6', 'author7']              author7         books
.
.
. 

I have tried to achieve this using pandas DataFrame explode method but I cannot find a way to store the results in a new column.

Thanks for the help.

Aniss Chohra
  • 391
  • 4
  • 18

3 Answers3

6

Since pandas 0.25.0 we have the explode method. First we duplicate the authors column and rename it at the same time using assign then we explode this column to rows and duplicate the other columns:

df.assign(author=df['authors']).explode('author')

Output

  publication_title                      authors         type   author
0           title_1  [author1, author2, author3]  proceedings  author1
0           title_1  [author1, author2, author3]  proceedings  author2
0           title_1  [author1, author2, author3]  proceedings  author3
1           title_2           [author4, author5]  collections  author4
1           title_2           [author4, author5]  collections  author5
2           title_3           [author6, author7]        books  author6
2           title_3           [author6, author7]        books  author7

If you want remove the duplicated index, use reset_index:

df.assign(author=df['authors']).explode('author').reset_index(drop=True)

Output

  publication_title                      authors         type   author
0           title_1  [author1, author2, author3]  proceedings  author1
1           title_1  [author1, author2, author3]  proceedings  author2
2           title_1  [author1, author2, author3]  proceedings  author3
3           title_2           [author4, author5]  collections  author4
4           title_2           [author4, author5]  collections  author5
5           title_3           [author6, author7]        books  author6
6           title_3           [author6, author7]        books  author7
Erfan
  • 40,971
  • 8
  • 66
  • 78
0

You can first make a new DataFrame with the authors:

df2 = pd.DataFrame(df['author'].tolist(), index=df.index).stack()

Next we drop the second level index:

df2.index = df2.index.droplevel(1)

Next we can concatenate on the second axis:

>>> pd.concat([df, df2], axis=1)
     title                       author         type        0
0  title 1  [author1, author2, author3]  proceedings  author1
0  title 1  [author1, author2, author3]  proceedings  author2
0  title 1  [author1, author2, author3]  proceedings  author3
1  title 2           [author4, author5]  collections  author4
1  title 2           [author4, author5]  collections  author5
2  title 3           [author6, author7]        books  author6
2  title 3           [author6, author7]        books  author7

or with a one-liner:

>>> pd.concat([df, pd.DataFrame(df['author'].tolist(), index=df.index).stack().reset_index(level=1, drop=True)], axis=1)
     title                       author         type        0
0  title 1  [author1, author2, author3]  proceedings  author1
0  title 1  [author1, author2, author3]  proceedings  author2
0  title 1  [author1, author2, author3]  proceedings  author3
1  title 2           [author4, author5]  collections  author4
1  title 2           [author4, author5]  collections  author5
2  title 3           [author6, author7]        books  author6
2  title 3           [author6, author7]        books  author7
Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
0

You've discovered explode which means you are almost there! Just merge original data with exploded data, see code below,

# data
df = pd.DataFrame({'publication_title':['title_1','title_2','title_3'],
              'authors':[['author1', 'author2', 'author3'],['author4', 'author5'],['author6', 'author7']],
              'type':['proceedings','collections','books']})
(df.explode(column='authors')
   .rename(columns={'authors':'author'})
   .merge(df))
Mark Wang
  • 2,623
  • 7
  • 15