10

I need to insert missing category for each group, here is an example:

import pandas as pd
import numpy as np

df = pd.DataFrame({ "group":[1,1,1 ,2,2],
                   "cat": ['a', 'b', 'c', 'a', 'c'] ,
                   "value": range(5),
                   "value2": np.array(range(5))* 2})

df

# test dataframe

cat group   value value2
a   1         0   0
b   1         1    2
c   1         2    4
a   2         3    6
c   2         4    8

say I have some categories = ['a', 'b', 'c', 'd']. if cat column does not contain a category from the list, I would like to insert a row, for each group with value 0. how to insert a row per group if category, so as to get all the categories for each group

cat group   value  value2
a   1         0    0
b   1         1    2
c   1         2    4
d   1         0    0
a   2         3    6
c   2         4    8
b   2         0    0
d   2         0    0
muon
  • 12,821
  • 11
  • 69
  • 88
  • Possible Duplicate: [link](https://stackoverflow.com/questions/44599589/inserting-new-rows-in-pandas-data-frame-at-specific-indices) – Gumboy Nov 05 '17 at 04:02
  • 1
    Please check timings in my answer, `groupby` here is really not necessary. – jezrael Nov 05 '17 at 07:17
  • 1
    What about `mux = pd.MultiIndex.from_product([df['group'].unique(), df['group1'].unique(), categories], names=('group','group1','cat'))` and then `df.set_index(['group','group1','cat']).reindex(mux, fill_value=0)` ? – jezrael Nov 05 '17 at 20:52

6 Answers6

5

groupby here is not neccesary, only need reindex by MultiIndex:

categories = ['a', 'b', 'c', 'd']
mux = pd.MultiIndex.from_product([df['group'].unique(), categories], names=('group','cat'))
df = df.set_index(['group','cat']).reindex(mux, fill_value=0).swaplevel(0,1).reset_index()
print (df)
  cat  group  value  value2
0   a      1      0       0
1   b      1      1       2
2   c      1      2       4
3   d      1      0       0
4   a      2      3       6
5   b      2      0       0
6   c      2      4       8
7   d      2      0       0

There is a lot of solutions, so I add timings:

np.random.seed(123)
N = 1000000
L = list('abcd') #235,94.1,156ms

df = pd.DataFrame({'cat': np.random.choice(L, N, p=(0.002,0.002,0.005, 0.991)),
                   'group':np.random.randint(10000,size=N),
                   'value':np.random.randint(1000,size=N),
                   'value2':np.random.randint(5000,size=N)})
df = df.sort_values(['group','cat']).drop_duplicates(['group','cat']).reset_index(drop=True)
print (df.head(10))

categories = ['a', 'b', 'c', 'd']

def jez(df):

    mux = pd.MultiIndex.from_product([df['group'].unique(), categories], names=('group','cat'))
    return df.set_index(['group','cat']).reindex(mux, fill_value=0).swaplevel(0,1).reset_index()

def f(x):
    return x.reindex(categories, fill_value=0).assign(group=x['group'][0].item())

def coldspeed(df):
    return df.set_index('cat').groupby('group', group_keys=False).apply(f).reset_index()    

def zero(df):
    from itertools import product
    dfo = pd.DataFrame(list(product(df['group'].unique(), categories)),
                            columns=['group', 'cat'])
    return dfo.merge(df, how='left').fillna(0)

def wen(df):
    return df.groupby('group',as_index=False).apply(lambda x : x.set_index('cat').reindex(categories)).fillna(0).reset_index().drop('level_0',1)

def bharath(df):
    mux = pd.MultiIndex.from_product([df['group'].unique(), categories], names=('group','cat'))
    return mux.to_frame().merge(df,on=['cat','group'],how='outer').fillna(0)

def akilat90(df):
    grouped = df.groupby('group')
    categories =  pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
    merged_list = []
    for g in grouped:
        merged = pd.merge(categories, g[1], how = 'outer', on='cat')
        merged['group'].fillna(merged['group'].mode()[0],inplace=True) # replace the `group` column's `NA`s by mode
        merged.fillna(0, inplace=True)
        merged_list.append(merged)

    return pd.concat(merged_list)


print (jez(df))
print (coldspeed(df))
print (zero(df))
print (wen(df))
print (bharath(df))
print (akilat90(df))

In [262]: %timeit (jez(df))
100 loops, best of 3: 11.5 ms per loop

In [263]: %timeit (bharath(df))
100 loops, best of 3: 16 ms per loop

In [264]: %timeit (zero(df))
10 loops, best of 3: 28.3 ms per loop

In [265]: %timeit (wen(df))
1 loop, best of 3: 8.74 s per loop

In [266]: %timeit (coldspeed(df))
1 loop, best of 3: 8.2 s per loop

In [297]: %timeit (akilat90(df))
1 loop, best of 3: 23.6 s per loop
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
3

A bit complicated, but you can use groupby + reindex:

categories = ['a', 'b', 'c', 'd']

def f(x):
    return x.reindex(categories, fill_value=0)\
                   .assign(group=x['group'][0].item())

df.set_index('cat').groupby('group', group_keys=False).apply(f).reset_index()


  cat  group  value  value2
0   a      1      0       0
1   b      1      1       2
2   c      1      2       4
3   d      1      0       0
4   a      2      3       6
5   b      2      0       0
6   c      2      4       8
7   d      2      0       0
cs95
  • 379,657
  • 97
  • 704
  • 746
3

This is one line solution ...

df.groupby('group',as_index=False).apply(lambda x : x.set_index('cat').\
       reindex(categories)).fillna(0).reset_index().drop('level_0',1)
Out[601]: 
  cat  group  value  value2
0   a    1.0    0.0     0.0
1   b    1.0    1.0     2.0
2   c    1.0    2.0     4.0
3   d    0.0    0.0     0.0
4   a    2.0    3.0     6.0
5   b    0.0    0.0     0.0
6   c    2.0    4.0     8.0
7   d    0.0    0.0     0.0
BENY
  • 317,841
  • 20
  • 164
  • 234
2

We can also use multi index as suggested by @jezreal then merge the data which significantly a very faster solution i.e

mux = pd.MultiIndex.from_product([df['group'].unique(), categories], names=('group','cat'))
ndf = mux.to_frame().merge(df,on=['cat','group'],how='outer').fillna(0)

Output :

 cat  group  value  value2
0   a      1    0.0     0.0
1   b      1    1.0     2.0
2   c      1    2.0     4.0
3   d      1    0.0     0.0
4   a      2    3.0     6.0
5   b      2    0.0     0.0
6   c      2    4.0     8.0
7   d      2    0.0     0.0
Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108
1

Use merge on precomputed combinations of cat, group

In [35]: from itertools import product

In [36]: cats = ['a', 'b', 'c', 'd']

In [37]: dfo = pd.DataFrame(list(product(df['group'].unique(), cats)),
                            columns=['group', 'cat'])

In [38]: dfo.merge(df, how='left').fillna(0)
Out[38]:
   group cat  value  value2
0      1   a    0.0     0.0
1      1   b    1.0     2.0
2      1   c    2.0     4.0
3      1   d    0.0     0.0
4      2   a    3.0     6.0
5      2   b    0.0     0.0
6      2   c    4.0     8.0
7      2   d    0.0     0.0
Zero
  • 74,117
  • 18
  • 147
  • 154
0

Not an elegant way to do this; I wish if I knew a way to merge at group level so that the for loop can be eliminated. That discussion is here


Solution

Think of categories list as a dataframe and do merging at a group level after a group by.

categories =  pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
print(categories)

grouped = df.groupby('group')

This is the ugly part. I wonder if there's a pandas way to eliminate this for loop:

merged_list = []
for g in grouped:
    merged = pd.merge(categories, g[1], how = 'outer', on='cat')
    merged['group'].fillna(merged['group'].mode()[0],inplace=True) # replace the `group` column's `NA`s by mode
    merged.fillna(0, inplace=True)
    merged_list.append(merged)
    print(merged)

  cat  group  value  value2
0   a    1.0    0.0     0.0
1   b    1.0    1.0     2.0
2   c    1.0    2.0     4.0
3   d    1.0    0.0     0.0
  cat  group  value  value2
0   a    2.0    3.0     6.0
1   b    2.0    0.0     0.0
2   c    2.0    4.0     8.0
3   d    2.0    0.0     0.0

Then we can just concat the merged_list

out = pd.concat(merged_list)
print(out)


  cat  group  value  value2
0   a    1.0    0.0     0.0
1   b    1.0    1.0     2.0
2   c    1.0    2.0     4.0
3   d    1.0    0.0     0.0
0   a    2.0    3.0     6.0
1   b    2.0    0.0     0.0
2   c    2.0    4.0     8.0
3   d    2.0    0.0     0.0
akilat90
  • 5,436
  • 7
  • 28
  • 42
  • Unfortunately I cannot add your solution to my timings, because I get error :( – jezrael Nov 05 '17 at 06:44
  • @akilat90 - I get it, I forget conver `categories` to `DataFrame`. Your solution was added to timings. – jezrael Nov 05 '17 at 07:05
  • Thanks @jezrael. The for loop + groupby merge is terrible :( – akilat90 Nov 05 '17 at 07:09
  • I am afraid it is True :( – jezrael Nov 05 '17 at 07:09
  • Sorry, my last comment wasn't clear. See my [answer to your question here](https://stackoverflow.com/a/47118758/4909087) on how to spruce this code, you can do this in a single line. – cs95 Nov 05 '17 at 17:14
  • @coldspeed thanks for the nice pythonic one liner! But, that was a little trimmed down version of this. Notice that I use an overcomplicated `mode` method to fill the `group` column here? That is not included in the other question. I wonder if it's still possible to write by a list comprehension. – akilat90 Nov 05 '17 at 17:22