3

we have the following data

  Name  genres

  A     Action|Adventure|Science Fiction|Thriller 
  B     Action|Adventure|Science Fiction|Thriller 
  C     Adventure|Science Fiction|Thriller

I want the data such that my data frame is

  Name  genres
  A     Action
  A     Adventure
  A     Science Fiction
  A     Thriller
  B     Action
  B     Adventure
  B     Science Fiction
  B     Thriller
  C     Adventure
  C     Science Fiction
  C     Thriller

here is my code

gen = df1[df1['genres'].str.contains('|')]
gen1 = gen.copy()
gen2 = gen.copy()
gen3 = gen.copy()
gen4 = gen.copy()
gen1['genres'] = gen1['genres'].apply(lambda x: x.split("|")[0])
gen2['genres'] = gen2['genres'].apply(lambda x: x.split("|")[1])
gen3['genres'] = gen3['genres'].apply(lambda x: x.split("|")[2])   
gen4['genres'] = gen4['genres'].apply(lambda x: x.split("|")[3])

I am getting the error

IndexError: list index out of range

Aakash Patel
  • 274
  • 2
  • 17
  • 2
    Possible duplicate of [Split (explode) pandas dataframe string entry to separate rows](https://stackoverflow.com/questions/12680754/split-explode-pandas-dataframe-string-entry-to-separate-rows) – Sai Kumar Sep 30 '18 at 09:04

1 Answers1

5

Crete list of genres by split, repeat values by str.len and last flatten lists by chain.from_iterable:

from itertools import chain

genres = df['genres'].str.split('|')
df = pd.DataFrame({
    'Name' : df['Name'].values.repeat(genres.str.len()),
    'genres' : list(chain.from_iterable(genres.tolist()))
})

print (df)
   Name           genres
0     A           Action
1     A        Adventure
2     A  Science Fiction
3     A         Thriller
4     B           Action
5     B        Adventure
6     B  Science Fiction
7     B         Thriller
8     C        Adventure
9     C  Science Fiction
10    C         Thriller

EDIT:

Solution for dynamic number of columns:

print (df)
  Name                                     genres  col
0    A  Action|Adventure|Science Fiction|Thriller    2
1    B  Action|Adventure|Science Fiction|Thriller    3
2    C         Adventure|Science Fiction|Thriller    5

from itertools import chain

cols = df.columns.difference(['genres'])
genres = df['genres'].str.split('|')

df =  (df.loc[df.index.repeat(genres.str.len()), cols]
         .assign(genres=list(chain.from_iterable(genres.tolist()))))
print (df)
  Name  col           genres
0    A    2           Action
0    A    2        Adventure
0    A    2  Science Fiction
0    A    2         Thriller
1    B    3           Action
1    B    3        Adventure
1    B    3  Science Fiction
1    B    3         Thriller
2    C    5        Adventure
2    C    5  Science Fiction
2    C    5         Thriller
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Now if suppose I have 10 columns and I want to the same operation I would have to enter 10 columns ```df = pd.DataFrame({ 'Name' : df['Name'].values.repeat(genres.str.len()), 'genres' : list(chain.from_iterable(genres.tolist())) })``` Is there some way to avoid writing individual columns in the ```pd.dataframe()``` – Aakash Patel Sep 30 '18 at 07:03
  • Solution is possible changed, give me a sec – jezrael Sep 30 '18 at 07:07
  • 1
    @AakashPatel - Please check edited answer - idea is repeat index values and by `loc` repeat values. – jezrael Sep 30 '18 at 07:11
  • when I am trying to get ```df.genres.unique``` I am getting the whole column again not the unique entries – Aakash Patel Oct 01 '18 at 09:23
  • @AakashPatel - Not sure if understand, if use `print (df.genres.unique)` after `df = (df.loc[df.index.repeat(genres.str.len()), cols] .assign(genres=list(chain.from_iterable(genres.tolist())))) print (df)` ? – jezrael Oct 01 '18 at 09:25
  • I am getting the output as ``` – Aakash Patel Oct 01 '18 at 09:35
  • @AakashPatel - sorry, need `()` like `print (df.genres.unique())` – jezrael Oct 01 '18 at 10:04