0

I have a df.

df=pd.DataFrame(data=[[301,301,302,303],[['a'],['b','c'],['e','f',33,'Z'],42],index=['id','foo']).T

I would like to get to a second dataframe, with only scalar values in foo. If and only if the original values was a list, I would like to spread it over multiple new rows, with the other values duplicated.

e.g from:

 id     foo
0  301     [a]
1  301  [b, c]
2  302  [e, f,33,'Z']
3  303   42

to:

 id     foo
0  301     a
1  301     b
1  301     c 
2  302     e
2  302     f
2  302     33
2  302     Z
3  303     42

In Split set values from Pandas dataframe cell over multiple rows, I have learned how to do this for one columns, but how to handle the case where df has multiple columns which need to be duplicated as id?

00__00__00
  • 4,834
  • 9
  • 41
  • 89

2 Answers2

1
df.set_index('id')['foo'].apply(pd.Series).stack().reset_index(name='foo').drop('level_1', axis=1)

Output

    id foo
0  301   a
1  301   b
2  301   c
3  302   e
4  302   f
5  302  33
6  302   Z
7  303  42

Multiple Columns Scenario

    id test            foo
0  301    1            [a]
1  301    2         [b, c]
2  302    3  [e, f, 33, Z]
3  303    4             42

using pd.set_index

df.set_index(['id','test'])['foo'].apply(pd.Series).stack().reset_index(name='foo').drop('level_2', axis=1)

Output

    id  test foo
0  301     1   a
1  301     2   b
2  301     2   c
3  302     3   e
4  302     3   f
5  302     3  33
6  302     3   Z
7  303     4  42
iamklaus
  • 3,720
  • 2
  • 12
  • 21
1

If you want avoid use apply(pd.Series) because slow, here is another solution - convert non lists values to one element list first and then apply solution:

df['foo']  = [x if isinstance(x, list) else [x] for x in df['foo']]

from itertools import chain

df = pd.DataFrame({
    'id' : df['id'].values.repeat(df['foo'].str.len()),
    'foo' : list(chain.from_iterable(df['foo'].tolist()))

})

Or:

L  = [x if isinstance(x, list) else [x] for x in df['foo']]

from itertools import chain

df = pd.DataFrame({
    'id' : df['id'].values.repeat([len(x) for x in L]),
    'foo' : list(chain.from_iterable(L))

})
print (df)
    id foo
0  301   a
1  301   b
2  301   c
3  302   e
4  302   f
5  302  33
6  302   Z
7  303  42

If small data or performance is not important - solution with pop for extract column foo:

s = df.pop('foo').apply(pd.Series).stack().reset_index(level=1, drop=True).rename('foo')
df = df.join(s).reset_index(drop=True)

Or solution with drop:

s = df['foo'].apply(pd.Series).stack().reset_index(level=1, drop=True).rename('foo')
df = df.drop('foo', axis=1).join(s).reset_index(drop=True)

print (df)

    id foo
0  301   a
1  301   b
2  301   c
3  302   e
4  302   f
5  302  33
6  302   Z
7  303  42

df=pd.DataFrame(data=[[301,301,302,303],[['a'],['b','c'],['e','f',33,'Z'],42]],index=['id','foo']).T

df = pd.concat([df] * 1000, ignore_index=True)

def f(df):
    s = df['foo'].apply(pd.Series).stack().reset_index(level=1, drop=True).rename('foo')
    return df.drop('foo', axis=1).join(s).reset_index(drop=True)


In [241]: %timeit (f(df))
814 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [242]: %%timeit
     ...: L  = [x if isinstance(x, list) else [x] for x in df['foo']]
     ...: 
     ...: from itertools import chain
     ...: 
     ...: pd.DataFrame({
     ...:     'id' : df['id'].values.repeat([len(x) for x in L]),
     ...:     'foo' : list(chain.from_iterable(L))
     ...: 
     ...: })
     ...: 
2.6 ms ± 15.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252