2

One of the columns in my pandas dataframe contains a list. And I want to expand it and convert vertical shape like below. How to do it?

Before(code):

import pandas as pd
pd.DataFrame({
    'col1':['fruit', 'veicle', 'animal'],
    'col2':['apple', 'bycicle', 'cat'],
    'col3':[1,4,2],
    'list':[
        [10, 20],
        [1.2, 3.0, 2.75],
        ['tommy', 'tom']
    ]
})

Before(table):

    |col1  |col2   |col3|list            |
    |------|-------|----|----------------|
    |fruit |apple  |   1|[10, 20]        |
    |veicle|bicycle|   4|[1.2, 3.0, 2.75]|
    |animal|cat    |   2|['tommy', 'tom']|

After

    |col1  |col2   |col3|list   |
    |------|-------|----|-------|
    |fruit |apple  |   1|10     |
    |fruit |apple  |   1|20     |
    |viecle|bycicle|   4|1.2    |
    |viecle|bycicle|   4|3.0    |
    |viecle|bycicle|   4|2.75   |
    |animal|cat    |   2|'tommy'|
    |animal|cat    |   2|'tom   |

Note1: Length and type of lists is different.

Note2: I can NOT modify the code for generating datafarme.

Thank you for reading.

cs95
  • 379,657
  • 97
  • 704
  • 746
AkiraIsaka
  • 23
  • 3
  • Possible duplicate of [Explode lists with different lengths in Pandas](https://stackoverflow.com/questions/45885143/explode-lists-with-different-lengths-in-pandas) – BENY Aug 27 '17 at 17:56
  • before asking you can simply google it , https://stackoverflow.com/questions/45885143/explode-lists-with-different-lengths-in-pandas/45886206#45886206 – BENY Aug 27 '17 at 17:56
  • Thank you for useful link and excuse me for posting a duplicate question. I carefully have searched on Google, But I could not find that article. – AkiraIsaka Aug 29 '17 at 22:48

3 Answers3

5

You can set_index of first three columns and then apply pd.Series to the column of list and then stack them.

df.set_index(['col1','col2','col3'])['list'].apply(pd.Series).stack().reset_index().drop('level_3',axis=1)

Output:

     col1     col2  col3      0
0  fruit   apple    1     10   
1  fruit   apple    1     20   
2  veicle  bycicle  4     1.2  
3  veicle  bycicle  4     3    
4  veicle  bycicle  4     2.75 
5  animal  cat      2     tommy
6  animal  cat      2     tom  
Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108
4

Learned this cool trick from piR the other day, using np.repeat and np.concatenate:

idx = np.arange(len(df)).repeat(df.list.str.len(), 0)    
out = df.iloc[idx, :-1].assign(list=np.concatenate(df.list.values))
print(out)

     col1     col2  col3   list
0   fruit    apple     1     10
0   fruit    apple     1     20
1  veicle  bycicle     4    1.2
1  veicle  bycicle     4    3.0
1  veicle  bycicle     4   2.75
2  animal      cat     2  tommy
2  animal      cat     2    tom

Performance

Small

# Bharath
%timeit df.set_index(['col1','col2','col3']['list'].apply(pd.Series).stack()\
              .reset_index().drop('level_3',axis=1)
100 loops, best of 3: 7.75 ms per loop

# Mine
%%timeit 
idx = np.arange(len(df)).repeat(df.list.str.len(), 0)    
out = df.iloc[idx, :-1].assign(list=np.concatenate(df.list.values))    
1000 loops, best of 3: 1.41 ms per loop

Large

df_test = pd.concat([df] * 10000)

# Bharath
%timeit df_test.set_index(['col1','col2','col3'])['list'].apply(pd.Series).stack()\
              .reset_index().drop('level_3',axis=1)
1 loop, best of 3: 7.09 s per loop

# Mine
%%timeit 
idx = np.arange(len(df_test)).repeat(df_test.list.str.len(), 0)    
out = df_test.iloc[idx, :-1].assign(list=np.concatenate(df_test.list.values))
10 loops, best of 3: 123 ms per loop

As a 1 liner, Bharath's answer is short, but slow. Here's an improvement that uses the dataframe constructor instead of df.apply for a 200x speedup on large data:

idx = df.set_index(['col1', 'col2', 'col3']).index
out = pd.DataFrame(df.list.values.tolist(), index=idx).stack()\
                .reset_index().drop('level_3', 1).rename(columns={0 : 'list'})

print(out)

     col1     col2  col3   list
0   fruit    apple     1     10
1   fruit    apple     1     20
2  veicle  bycicle     4    1.2
3  veicle  bycicle     4      3
4  veicle  bycicle     4   2.75
5  animal      cat     2  tommy
6  animal      cat     2    tom

Small

100 loops, best of 3: 4.7 ms per loop

Large

10 loops, best of 3: 28.9 ms per loop
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Numpy is ofc very fast. Its hard to beat a numpy answer. – Bharath M Shetty Aug 27 '17 at 15:01
  • @Bharathshetty Yes but I didn't expect pandas to be this slow. – cs95 Aug 27 '17 at 15:01
  • I used apply. So yes its bit slow. I think apply always kills bit of performance. – Bharath M Shetty Aug 27 '17 at 15:02
  • @Bharathshetty Edited my answer with an improvement over your solution. – cs95 Aug 27 '17 at 15:15
  • I exactly wanted to add that, since it was same stack and reset_index() I didn't add that and I didn't know it would improve the performance. Really nice – Bharath M Shetty Aug 27 '17 at 15:20
  • @Wen Yes. What I don't get is that a sub optimal solution has 5 up votes. – cs95 Aug 27 '17 at 17:51
  • @cᴏʟᴅsᴘᴇᴇᴅ me too , I will give your answer my upvote and mark the question as duplicate – BENY Aug 27 '17 at 17:55
  • @cᴏʟᴅsᴘᴇᴇᴅ Could you please test this timing ? :)`out = pd.DataFrame(pd.DataFrame(item for item in df['list']).values, index=idx).stack()\ .reset_index().drop('level_3', 1).rename(columns={0 : 'list'})` – BENY Aug 27 '17 at 18:32
  • Thank you for your solution and detailed performance check and Excuse me for my late response. This is the very brilliant method! – AkiraIsaka Aug 29 '17 at 22:58
0

Here is roughly how you can accomplish this task. This is not the exact solution but will give you an idea of how you accomplish your task:

original_df = <your dataframe to start>
new_empty_df = pd.DataFrame()
# now go through each row of the original df
for i in range(original_df.shape[0]):
    row_Series = original_df.iloc[i]
    row_list = row_Series['list']
    for item in row_list:
         new_empty_df.append({'col1':row_Series['col1'],
                              'col2':row_Series['col2'],
                               'list':item})
Heapify
  • 2,581
  • 17
  • 17