3

I have the following DataFrame df:

center    status     devices
1     Green      [d1, d2]
1     Green      [d5, d1, d2]
2     Green      []
3     Green      [d5, d6]

I need to unfold lists in the column devices. The goal is to group data by center and device, and then count the number of observations per group.

The expected result would be the following one:

center   device   count
1        d1       2
1        d2       2
1        d5       1
3        d5       1
3        d6       1
Tatik
  • 1,107
  • 1
  • 9
  • 17

2 Answers2

3

First flatten lists and then aggregate by DataFrameGroupBy.size:

#create Series
s = df.set_index('center')['devices']
#create DataFrame, reshape by stack and conver MultiIndex to columns
df = pd.DataFrame(s.values.tolist(), index=s.index).stack().reset_index()
df.columns= ['center','i','devices']

#aggregate count
df = df.groupby(['center','devices']).size().reset_index(name='count')
print (df)
   center device  count
0       1     d1      2
1       1     d2      2
2       1     d5      1
3       3     d5      1
4       3     d6      1

Another solution for improve performance:

from itertools import chain

df = pd.DataFrame({
    'devices' : list(chain.from_iterable(df['devices'].tolist())), 
    'center' : df['center'].values.repeat(df['devices'].str.len())
})


df = df.groupby(['center','devices']).size().reset_index(name='count')
print (df)
   center devices  count
0       1      d1      2
1       1      d2      2
2       1      d5      1
3       3      d5      1
4       3      d6      1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I tried this approach on my real data, but in `center` column I can only see one value. If I do `output[output["center"]==2]`, then the output is empty. But if I do `df[df["center"]==2]`, then I get rows. Do you know why it may happen? – Tatik Mar 27 '19 at 15:33
  • 1
    @Tatik - hmmm, here is empty list for group `2`, so this row is removed from data in output. – jezrael Mar 27 '19 at 15:35
2

Using unnesting after filter the empty list out then groupby size

unnesting(df[df.devices.astype(bool)],['devices']).groupby(['center','devices']).size().reset_index(name='count')
Out[214]: 
   center devices  count
0       1      d1      2
1       1      d2      2
2       1      d5      1
3       3      d5      1
4       3      d6      1

def unnesting(df, explode):
    idx=df.index.repeat(df[explode[0]].str.len())
    df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
    df1.index=idx
    return df1.join(df.drop(explode,1),how='left')
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
BENY
  • 317,841
  • 20
  • 164
  • 234
  • @jezrael yep , you are right ~ thank you for helping me :-) – BENY Mar 27 '19 at 14:44
  • 1
    @jezrael when I notice there is empty list , I guess , the op already excluded the `np.nan`, but this is only base on my assumption , which is not reliable – BENY Mar 27 '19 at 14:46